Problem Statement

The problem is to design a system which takes input a 'blog'/text and predicts the author gender detection with maximum efficiency possible. Clearly this problem is still under research because it is difficult for human being itself to recognize the gender of author for a given blogs. The utility of this system is to reduce the online crimes . In the recent times crime related to anonymity has increased. People may not feel to provide their true identity. Recent rise in online crimes where some people have even committed suicide due to these frauds , has become challenge for cyber police. We have tried to use probabilistic study done on gender specific writing style.


We have derived almost 545 features. These feature set has been divided into mainly 5 groups:

  1. Character-based features
  2. Word-based features
  3. Syntactic features
  4. Structural features
  5. Function words

To ensure all features are treated equally in the classification process, we normalized the features using max-min normalization method to ensure all feature values are between 0 and 1:

Normalized xij=(xij-min(xj))/(max(xj)-min(xj))


The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups: ∙8240 "10s" blogs (ages 13-17), ∙8086 "20s" blogs(ages 23-27) ∙2994 "30s" blogs (ages 33-47). For each age group there are an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The PreProcessing and Feature Selection

The dataset we got was in the form of xml. Processing that data was challenging as it contained unwanted tags and spaces. We converted these into pretty blogs which had no unwanted characters. Now the feature selection was done . We did the feature selection based on the above types.


The Classification Algorithms that classified the blogs on basis of gender was run on features calculated by us(about 450 features)

Different classifiers used and their efficiency in gender detection:

  1. Bayesian Networks 62%-65%
  2. Decision Trees 61%-63%
  3. SVM (Support Vector Machine) 65%-70%
  4. Random Forest 80%-82%

Links to