ProText Analyzer
Note
Apologies, but I did not use the NLTK package for some tasks. Instead, I used:
- TextBlob for sentiment analysis
- spaCy for various text processing tasks
- Syllapy for counting syllables in words
Project Structure
π Directories and Files
π Cleaned Articles
- cleaned_articles: Contains cleaned articles ready for analysis.
π Extracted Articles
- extracted_articles: Holds raw articles extracted for the project.
π Master Dictionary
- master_dictionary: Collection of files for sentiment analysis.
cleaned_negative_words.txt
: List of cleaned negative words.
cleaned_positive_words.txt
: List of cleaned positive words.
negative-words.txt
: Raw negative words for sentiment analysis.
positive-words.txt
: Raw positive words for sentiment analysis.
π Project Introduction
- project_introduction: Overview and objectives of the project.
π§ͺ Test Assessment
- test_assessment: Contains test assignments and notebooks.
dataextraction.ipynb
: Jupyter Notebook for data extraction tasks.
testassessment.ipynb
: Jupyter Notebook for additional test assessments.
π» Code and Markdown
- testassignment: Code and markdown files related to assignments.
Code + Markdown/
: Contains code snippets and explanations.
Run All/
: Script to execute all code cells in notebooks.
π« Stop Words
- Stop Words: Directory with various stop words files for preprocessing.
π Text Analysis
- text_analysis: Files for performing text analysis.
textanalysis.ipynb
: Jupyter Notebook for text analysis.
sentiment_analysis.log
: Log file for sentiment analysis results.
textblob_sentiment_result.csv
: CSV file with sentiment analysis results.
π Additional Files
- additional_files: Summary results and metrics.
analysis_results.csv
: Various analysis results.
final_text_analysis_results.xlsx
: Final compiled analysis results.
Blackcoffer Test Assignment
-
Consulting Website: Blackcoffer |
LSA Lead |
-
Web App Products: Netclan |
Insights |
Hire Kingdom |
Workcroft |
-
Mobile App Products: Netclan |
Bwstory |
Assignment Overview
- Objective: Extract textual data from provided URLs and perform text analysis.
- Data Extraction:
- Input from
Input.xlsx
- Tools: Python, BeautifulSoup, Selenium, Scrapy.
- Data Analysis:
- Output in CSV or Excel format.
- Variables include Positive Score, Negative Score, Polarity Score, etc.
- Timeline: Duration of 6 days.
- Submission: Via Google Form with required files.
Methodology
- Sentimental Analysis: Clean text using stop words, create dictionaries of positive/negative words, and extract variables.
- Readability Analysis: Calculate average sentence length, percentage of complex words, and Fog Index.
Objective:
The ProText-Analyzer project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.
Project Overview
The goal of ProText-Analyzer is to:
- Extract Textual Data: Fetch the article content from URLs provided in the
Input.xlsx
file.
- Perform Textual Analysis: Calculate the following metrics:
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (Fog Index, Avg. Sentence Length)
- Word count, syllable count, and other word statistics
Technologies Used
- Python π
- Libraries:
TextBlob
for sentiment analysis
spaCy
for text processing tasks (tokenization, POS tagging, etc.)
Syllapy
for syllable counting
BeautifulSoup
for HTML parsing during data extraction
Requests
for handling HTTP requests
- Pandas for data management
- Excel/CSV for input/output handling
Installation
- Clone the repository to your local machine:
git clone https://github.com/rubydamodar/ProText-Analyzer.git
cd ProText-Analyzer
- Install the required Python libraries:
pip install -r requirements.txt
The ProText-Analyzer extracts the article title and body from each URL listed in the Input.xlsx
file and stores the text for further analysis.
Process Overview:
- Read Input File: Load the URLs and their associated IDs from
Input.xlsx
.
- Extract Article Content:
- Fetch HTML content using
requests
.
- Parse the HTML using
BeautifulSoup
to extract the articleβs title and body.
- Save the extracted content into text files named after the
URL_ID
.
File Management:
- Each articleβs content is saved in text files, facilitating a clean process for further analysis.
- Error handling ensures proper management of file I/O and network issues.
Text Analysis Process
The extracted text undergoes several analysis steps to compute the following variables:
- Sentiment Analysis:
- Implemented using
TextBlob
to compute Positive Score, Negative Score, Polarity Score, and Subjectivity Score.
- Text is cleaned by removing stop words and irrelevant characters.
- Readability Analysis:
- Calculated using the Gunning Fog Index.
- Additional metrics: Average Sentence Length, Percentage of Complex Words, and Fog Index.
- Word-Level Metrics:
- Word Count, Complex Word Count, Syllable Count per Word (via
syllapy
), Personal Pronouns Count (using regex), and Average Word Length.
Output Structure
The results are saved in Excel/CSV format as per the structure outlined in Output Data Structure.xlsx
. The following variables are included:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Complex Word Count
- Word Count
- Syllable Count
- Personal Pronouns Count
- Average Word Length
How to Run
- Data Extraction:
Run the script to extract article data from the URLs:
python data_extraction.py
- Text Analysis:
Run the text analysis script to process the extracted articles:
The results will be saved in the output directory in .csv
or .xlsx
format.
Challenges and Solutions
- Error Handling: Implemented robust error handling to manage potential network and file-related issues.
- Text Processing: Utilized advanced tools like
spaCy
for precise text tokenization and POS tagging, and syllapy
for syllable counting.
- Personal Pronouns: Regex was used to accurately capture pronouns without including words like βUSβ mistakenly.
Contributing
We welcome contributions to enhance ProText-Analyzer! To contribute:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request with a detailed description of your changes.
License
This project is licensed under the MIT License.
Project Maintainer
Ruby Poddar
Email: rubypoddarr@gmail.com