Tutorials

Extracting a single feature

To extract a single feature, you will first need to import the Extractor class from the elfen.extractor module and initialize it with your data. The Extractor class will automatically preprocess your data upon initialization.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
# This will automatically load the spaCy model
# and preprocess the text column
# Assumes the text column is named "text"
extractor = Extractor(data = df)

To extract a single feature, you will need to use the extract method and pass the feature parameter with the desired feature.

# Extract a single feature
# In this case, we are extracting the "average_word_length" feature
extractor.extract("avg_word_length")

print(extractor.data.head())

Extracting a single feature with additional parameters

For some features, you may want to pass additional parameters or resources. For example, you may have a custom sentiment lexicon that you want to use for the emotional valence features. Additionally, your lexicon may have ratings collected on a different scale, and you may thus want to adapt what constitutes a high-valence word. You can pass these additional parameters such as custom lexicons and thresholds to the extract method.

import polars as pl

from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
extractor = Extractor(data = df)

# Custom lexicon
custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

# Extract a single feature with additional parameters
# We are passing a custom lexicon and a threshold
# Assuming the words in the lexicon are in the "word" column
# and the valence ratings are in the "valence" column
extractor.extract("n_low_valence", lexicon = custom_lexicon, threshold = 0.5)

print(extractor.data.head())

Extracting multiple specific features

You can extract multiple specific features at once by passing a list of features to the extract method.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
# This will automatically load the spaCy model
# and preprocess the text column
# Assumes the text column is named "text"
extractor = Extractor(data = df)

# Extract multiple specific features
# In this case, we are extracting the "avg_word_length" and "n_low_valence" features
extractor.extract(features = ["avg_word_length", "n_low_valence"])

print(extractor.data.head())

Unfortunately, at the moment you cannot pass additional parameters to the features when extracting multiple features at once.

Extracting feature areas

Instead of extracting features one by one, or all at once, it is possible to extract features in groups, or areas. This is useful when you want to extract features that are related to each other, or when you only want to analyze certain types of features.

Similar to the feature extraction showcased in Quickstart, you can extract features using the Extractor class. To do this, you will first need to import the Extractor class from the elfen.extractor module and Initialize it to preprocess your data.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
# This will automatically load the spaCy model
# and preprocess the text column
# Assumes the text column is named "text"
extractor = Extractor(data = df)

Given that you have initialized the Extractor class, you can now extract features in groups. To do this, you will need to use the extract_feature_group method and pass the feature_group parameter with the desired feature area.

# Extract features in groups
# This will extract all implemented features for the specified feature area
# In this case, we are extracting features from the "lexical_richness" area
extractor.extract_feature_group(feature_group = "lexical_richness")

print(extractor.data.head())

Alternatively, you can also extract features from multiple feature areas at once. To do this, you will need to pass a list of feature areas to the feature_group parameter.

# Extract features in groups
# This will extract all implemented features for the specified feature areas
# In this case, we are extracting features from the "lexical_richness" and "readability" areas
extractor.extract_feature_group(feature_group = ["lexical_richness", "readability"])

print(extractor.data.head())

For more information on the available feature areas, check the Feature Overview section.

Normalizing extracted features

We provide the possibility to normalize extracted features in four different ways:

  • normalize: Normalizes the extracted features such that they have a mean of 0 and a standard deviation of 1

  • token_normalize: Token-normalize occurence-based features (e.g. n_low_valence) by dividing the feature value by the number of tokens in the text.

  • ratio_normalize: Normalizes the extracted features using a specific ratio (e.g. given features divided by the number of tokens) with a new column being added (e.g. n_low_valence_token_ratio)

  • rescale: Rescales the extracted features using the min-max scaling method

Normalize

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
extractor = Extractor(data = df)

# Extract features
extractor.extract_feature_group(feature_group = "lexical_richness")
extractor.extract("avg_word_length")
extractor.extract("n_low_valence")

# Normalize extracted features
extractor.normalize("all") # Normalizes all extracted features
extractor.normalize("avg_word_length") # Normalizes specific feature
extractor.normalize(["avg_word_length", "n_low_valence"]) # Normalizes multiple specific features

print(extractor.data.head())

Token Normalize

Whenever you extract features that are based on the occurrence of words in a text, such as the number of low-valence words, it is often useful to normalize these features by the number of tokens in the text. This is especially useful when comparing texts of different lengths.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
extractor = Extractor(data = df)

# Extract features
extractor.extract("n_long_words")
extractor.extract("n_low_valence")

# Token normalize extracted features
extractor.token_normalize("all") # Token normalizes all extracted features starting with "n_" except for "n_tokens", "n_types" and "n_sentences", "n_lemmas" and "n_syllables"
# OR
extractor.token_normalize("n_low_valence") # Token normalizes specific feature
# OR
extractor.token_normalize(["n_long_words", "n_low_valence"]) # Token normalizes multiple specific features

print(extractor.data.head())

Ratio Normalize

Ratio-normalization allows you to normalize extracted features using a specific ratio, such as the number of tokens or the number of sentences. This is useful when you want to compare features that are not directly comparable, such as the number of low-valence words and the average word length. While the functionality is exactly the same for token-normalization, ratio-normalization will create a new column with the suffix of the used ratio (e.g. _token_ratio) instead of overwriting the original feature column. This allows for the comparison of the original and the normalized feature.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
extractor = Extractor(data = df)

# Extract features
extractor.extract_feature_group(feature_group = "lexical_richness")
extractor.extract("avg_word_length")
extractor.extract("n_low_valence")

# Ratio normalize extracted features
extractor.ratio_normalize("all", "token") # Ratio normalizes all extracted features
extractor.ratio_normalize("avg_word_length", "token") # Ratio normalizes specific feature
extractor.ratio_normalize(["avg_word_length", "n_low_valence"], "token") # Ratio normalizes multiple specific features

print(extractor.data.head())

Rescale

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
extractor = Extractor(data = df)

# Extract features
extractor.extract_feature_group(feature_group = "lexical_richness")
extractor.extract("avg_word_length")
extractor.extract("n_low_valence")

# Rescale extracted features to a range of 0 to 1
extractor.rescale("all") # Rescales all extracted features
extractor.rescale("avg_word_length") # Rescales specific feature
extractor.rescale(["avg_word_length", "n_low_valence"]) # Rescales multiple specific features

# Rescale extracted features to a custom range
extractor.rescale("all", minimum = 0, maximum = 10) # Rescales all extracted features to a range of 0 to 10

Specifying the model, language, text column, maximum length, and the used resources

By default, the Extractor class uses the spaCy backbone and the en_core_web_sm model, the column text, and a maximum length of 100,000 tokens for feature extraction. However, you can specify the model, language, text column, and maximum length of the text to process by passing the respective parameters to the Extractor class.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame
# This will automatically load the specified model
# and preprocess the text column
# Assumes the text column is named "comment"
extractor = Extractor(data = df,
                      language = "de",
                      model = "de_dep_news_trf",
                      text_column = "comment",
                      max_length = 10000)

# Extract features
extractor.extract_features()

print(extractor.data.head())

Extracting features using a custom configuration

In cases where you want to extract features using a specific model (either from spacy or stanza), in a specific language, or you have a specific set of features you want to extract, you can use a custom configuration.

To extract features using a custom configuration, you will need to pass a dictionary with the desired configuration to the extract method.

For example, you can extract features using the spacy backbone, in German, using the model de_dep_news_trf, with a maximum length of 10,000 and only extract the average word length from the surface features and the number of low-valence words and high-valence words from the emotion features.

import polars as pl
from elfen.extractor import Extractor

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Custom configuration
custom_config = {
    "backbone": "spacy",
    "language": "de",
    "model": "de_dep_news_trf",
    "max_length": 10000,
    "features": {
        "surface": ["avg_word_length"],
        "emotion": ["n_low_valence", "n_high_valence"]
    }
}

# Initialize the Extractor with your DataFrame and configuration
extractor = Extractor(data = df, config = custom_config)

# Extract features using a custom configuration
extractor.extract_features()

print(extractor.data.head())

For a full overview over available parameters in the custom configuration, check the Custom configuration section.

Extracting custom lexicon-based features

In cases where you want to extract features based on a custom lexicon that do not fit into the predefined feature areas or way of processing the specific feature, we provide the possibility to extract custom lexicon-based features using some custom template functions for five potential templated features of interest:

  • get_n_custom: Number of words in a text that are in a custom lexicon

  • get_occurs_custom: Whether or not a text contains a word from a custom lexicon

  • get_n_custom_high: The number of words in a text that are in a custom lexicon and have a rating above a certain threshold (given in another column of the lexicon)

  • get_n_custom_low: The number of words in a text that are in a custom lexicon and have a rating below a certain threshold.

  • get_avg_custom: The average rating of words in a text that are in a custom lexicon

To extract these custom lexicon-based features, you will need to load the respective custom lexicon as a polars DataFrame and extract the features as shown below. Note that currently, there is no possibility to pass custom lexicon-based features in a custom configuration, so you will have to extract these features separately using the respective template functions.

import polars as pl

from elfen.extractor import Extractor
from elfen.custom import (
    get_n_custom,
    get_occurs_custom,
    get_n_custom_low,
    get_n_custom_high,
    get_avg_custom
)

# Load your custom lexicon as a polars DataFrame
custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

# Load your dataset as a polars DataFrame
# example from csv
df = pl.read_csv("path/to/your/dataset.csv")

# Initialize the Extractor with your DataFrame;
# preprocessing will be done automatically
extractor = Extractor(data = df)

# Load your custom lexicon as a polars DataFrame
df = extractor.data

# Load your custom lexicon as a polars DataFrame
custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

# Number of words in a text that are in a custom lexicon
df = get_n_custom(data=df,  # DataFrame with text data
                  lexicon=custom_lexicon,  # DataFrame with custom lexicon
                  feature_name="n_custom",  # Name of the feature-column after extraction
                  word_column="word",  # Name of the column in the lexicon with the words
                  measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

# Whether or not a text contains a word from a custom lexicon
df = get_occurs_custom(data=df,  # DataFrame with text data
                       lexicon=custom_lexicon,  # DataFrame with custom lexicon
                       feature_name="occurs_custom",  # Name of the feature-column after extraction
                       word_column="word",  # Name of the column in the lexicon with the words
                       measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

# Number of words in a text that are in a custom lexicon and have a rating above a certain threshold
df = get_n_custom_high(data=df,  # DataFrame with text data
                       lexicon=custom_lexicon,  # DataFrame with custom lexicon
                       threshold=0.5,  # Threshold for the rating
                       feature_name="n_custom_high",  # Name of the feature-column after extraction
                       word_column="word",  # Name of the column in the lexicon with the words
                       feature_column="rating",  # Name of the column in the lexicon with the ratings
                       measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

# Number of words in a text that are in a custom lexicon and have a rating below a certain threshold
df = get_n_custom_low(data=df,  # DataFrame with text data
                      lexicon=custom_lexicon,  # DataFrame with custom lexicon
                      threshold=0.5,  # Threshold for the rating
                      feature_name="n_custom_low",  # Name of the feature-column after extraction
                      word_column="word",  # Name of the column in the lexicon with the words
                      feature_column="rating",  # Name of the column in the lexicon with the ratings
                      measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

# Average rating of words in a text that are in a custom lexicon
df = get_avg_custom(data=df,  # DataFrame with text data
                    lexicon=custom_lexicon,  # DataFrame with custom lexicon
                    feature_name="avg_custom",  # Name of the feature-column after extraction
                    word_column="word",  # Name of the column in the lexicon with the words
                    feature_column="rating",  # Name of the column in the lexicon with the ratings
                    measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

print(df.head())

Limiting the numbers of cores used

The underlying dataframe library, polars, uses all available cores by default. If you are working on a shared server, you may want to consider limiting the resources available to polars. To do that, you will have to set the POLARS_MAX_THREADS variable in your shell, e.g.:

# Limit the number of threads to 8
export POLARS_MAX_THREADS=8

Note

If you do not find a suitable template function or different feature extraction function, and you implement your own, please consider contributing to the package by opening a pull request on the GitHub repository.