← projects

Unstructured Text Mining & Pattern Analysis

NLP and anomaly detection on unstructured user-generated reports · 2024-11-17

ufo-report-scraper

Mining Insights from Unstructured Text Data

This project demonstrates NLP and machine learning techniques applied to a large corpus of unstructured user-generated reports from NUFORC (National UFO Reporting Center). It showcases adaptability—evolving from a web scraping tool to an ML experimentation platform when the source website redesigned.

Original Mission: Data Collection

The initial goal was straightforward:

  • Crawl NUFORC’s database of UFO sighting reports
  • Extract structured data (date, location, description, shape, duration)
  • Build a comprehensive dataset for analysis
  • Enable quantitative study of sighting patterns

NUFORC maintains one of the largest publicly available collections of UFO reports, making it a valuable data source for anyone interested in studying patterns in anomalous aerial phenomena reporting.

The Pivot: Website Redesign

When NUFORC redesigned their site, the original scraper broke. Rather than continuously chase DOM structure changes, the project pivoted to focus on what makes the data interesting: applying ML and AI techniques to understand it.

New Direction: ML Applications

The dataset (collected before the redesign) now serves as a foundation for exploring:

Natural Language Processing

  • Topic modeling on sighting descriptions
  • Sentiment analysis of witness accounts
  • Named entity recognition for locations and objects
  • Text classification by sighting characteristics

Pattern Recognition

  • Temporal analysis (time of day, seasonal patterns)
  • Geospatial clustering (hotspots, regional differences)
  • Shape and duration correlations
  • Witness behavior patterns in reporting

Anomaly Detection

  • Identifying unusual reports that deviate from typical patterns
  • Statistical analysis of outliers
  • Comparison with known aircraft flight paths
  • Cross-referencing with weather data

Why This Dataset is Interesting for ML

Beyond the subject matter, this corpus of user-generated reports presents unique challenges:

Unstructured Text: Free-form descriptions require robust NLP Noisy Data: Reports vary wildly in quality and detail Class Imbalance: Different sighting types occur at different frequencies Temporal Patterns: Long time-series data spanning decades Geospatial Complexity: Global distribution with clustering effects

Broader Applications

Techniques developed for analyzing this dataset apply to:

  • Anomaly detection in user-generated content
  • Pattern recognition in witness testimony
  • Geospatial-temporal analysis of events
  • Classification of unstructured reports

The Meta-Lesson

This project illustrates an important principle in data science: the value isn’t always in collecting new data, but in applying modern analytical techniques to existing datasets. When the scraping tool broke, the interesting problem shifted from “how do we get the data?” to “what can we learn from the data we have?”

Sometimes the best pivot is from engineering to science.

Current Status

The repository serves as an experimental ground for ML techniques applied to unusual datasets. It’s a reminder that interesting data science projects can emerge from unexpected sources—and that adaptability is as important as initial planning.