.. _tutorials:

Tutorials
=========

Extracting a single feature
---------------------------

To extract a single feature, you will first need to import the ``Extractor`` class from the ``elfen.extractor`` module and initialize it with your data. The ``Extractor`` class will automatically preprocess your data upon initialization.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    # This will automatically load the spaCy model
    # and preprocess the text column
    # Assumes the text column is named "text"
    extractor = Extractor(data = df)

To extract a single feature, you will need to use the ``extract`` method and pass the ``feature`` parameter with the desired feature.

.. code-block:: python

    # Extract a single feature
    # In this case, we are extracting the "average_word_length" feature
    extractor.extract("avg_word_length")

    print(extractor.data.head())

Extracting a single feature with additional parameters
------------------------------------------------------

For some features, you may want to pass additional parameters or resources. For example, you may have a custom sentiment lexicon that you want to use for the emotional valence features. Additionally, your lexicon may have ratings collected on a different scale, and you may thus want to adapt what constitutes a high-valence word.
You can pass these additional parameters such as custom lexicons and thresholds to the ``extract`` method.

.. code-block:: python
    
    import polars as pl

    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    extractor = Extractor(data = df)

    # Custom lexicon
    custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

    # Extract a single feature with additional parameters
    # We are passing a custom lexicon and a threshold
    # Assuming the words in the lexicon are in the "word" column
    # and the valence ratings are in the "valence" column
    extractor.extract("n_low_valence", lexicon = custom_lexicon, threshold = 0.5)

    print(extractor.data.head())

Extracting multiple specific features
-------------------------------------

You can extract multiple specific features at once by passing a list of features to the ``extract`` method.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    # This will automatically load the spaCy model
    # and preprocess the text column
    # Assumes the text column is named "text"
    extractor = Extractor(data = df)

    # Extract multiple specific features
    # In this case, we are extracting the "avg_word_length" and "n_low_valence" features
    extractor.extract(features = ["avg_word_length", "n_low_valence"])

    print(extractor.data.head())

Unfortunately, at the moment you cannot pass additional parameters to the features when extracting multiple features at once.

Extracting feature areas
------------------------

Instead of extracting features one by one, or all at once, it is possible to extract features in groups, or areas. This is useful when you want to extract features that are related to each other, or when you only want to analyze certain types of features.

Similar to the feature extraction showcased in :ref:`quickstart`, you can extract features using the ``Extractor`` class.
To do this, you will first need to import the ``Extractor`` class from the ``elfen.extractor`` module and Initialize it to preprocess your data.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    # This will automatically load the spaCy model
    # and preprocess the text column
    # Assumes the text column is named "text"
    extractor = Extractor(data = df)

Given that you have initialized the ``Extractor`` class, you can now extract features in groups. To do this, you will need to use the ``extract_feature_group`` method and pass the ``feature_group`` parameter with the desired feature area.

.. code-block:: python

    # Extract features in groups
    # This will extract all implemented features for the specified feature area
    # In this case, we are extracting features from the "lexical_richness" area
    extractor.extract_feature_group(feature_group = "lexical_richness")

    print(extractor.data.head())

Alternatively, you can also extract features from multiple feature areas at once. To do this, you will need to pass a list of feature areas to the ``feature_group`` parameter.

.. code-block:: python

    # Extract features in groups
    # This will extract all implemented features for the specified feature areas
    # In this case, we are extracting features from the "lexical_richness" and "readability" areas
    extractor.extract_feature_group(feature_group = ["lexical_richness", "readability"])

    print(extractor.data.head())

For more information on the available feature areas, check the :ref:`feature_overview` section.

Normalizing extracted features
-------------------------------

We provide the possibility to normalize extracted features in four different ways:

- ``normalize``: Normalizes the extracted features such that they have a mean of 0 and a standard deviation of 1
- ``token_normalize``: Token-normalize occurence-based features (e.g. ``n_low_valence``) by dividing the feature value by the number of tokens in the text.
- ``ratio_normalize``: Normalizes the extracted features using a specific ratio (e.g. given features divided by the number of tokens) with a new column being added (e.g. ``n_low_valence_token_ratio``)
- ``rescale``: Rescales the extracted features using the min-max scaling method

Normalize
~~~~~~~~~

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    extractor = Extractor(data = df)

    # Extract features
    extractor.extract_feature_group(feature_group = "lexical_richness")
    extractor.extract("avg_word_length")
    extractor.extract("n_low_valence")

    # Normalize extracted features
    extractor.normalize("all") # Normalizes all extracted features
    extractor.normalize("avg_word_length") # Normalizes specific feature
    extractor.normalize(["avg_word_length", "n_low_valence"]) # Normalizes multiple specific features

    print(extractor.data.head())

Token Normalize
~~~~~~~~~~~~~~~~

Whenever you extract features that are based on the occurrence of words in a text, such as the number of low-valence words, it is often useful to normalize these features by the number of tokens in the text. This is especially useful when comparing texts of different lengths.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    extractor = Extractor(data = df)

    # Extract features
    extractor.extract("n_long_words")
    extractor.extract("n_low_valence")

    # Token normalize extracted features
    extractor.token_normalize("all") # Token normalizes all extracted features starting with "n_" except for "n_tokens", "n_types" and "n_sentences", "n_lemmas" and "n_syllables"
    # OR
    extractor.token_normalize("n_low_valence") # Token normalizes specific feature
    # OR
    extractor.token_normalize(["n_long_words", "n_low_valence"]) # Token normalizes multiple specific features

    print(extractor.data.head())

Ratio Normalize
~~~~~~~~~~~~~~~

Ratio-normalization allows you to normalize extracted features using a specific ratio, such as the number of tokens or the number of sentences. This is useful when you want to compare features that are not directly comparable, such as the number of low-valence words and the average word length.
While the functionality is exactly the same for token-normalization, ratio-normalization will create a new column with the suffix of the used ratio (e.g. ``_token_ratio``) instead of overwriting the original feature column.
This allows for the comparison of the original and the normalized feature.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    extractor = Extractor(data = df)

    # Extract features
    extractor.extract_feature_group(feature_group = "lexical_richness")
    extractor.extract("avg_word_length")
    extractor.extract("n_low_valence")

    # Ratio normalize extracted features
    extractor.ratio_normalize("all", "token") # Ratio normalizes all extracted features
    extractor.ratio_normalize("avg_word_length", "token") # Ratio normalizes specific feature
    extractor.ratio_normalize(["avg_word_length", "n_low_valence"], "token") # Ratio normalizes multiple specific features

    print(extractor.data.head())

Rescale
~~~~~~~

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    extractor = Extractor(data = df)

    # Extract features
    extractor.extract_feature_group(feature_group = "lexical_richness")
    extractor.extract("avg_word_length")
    extractor.extract("n_low_valence")

    # Rescale extracted features to a range of 0 to 1
    extractor.rescale("all") # Rescales all extracted features
    extractor.rescale("avg_word_length") # Rescales specific feature
    extractor.rescale(["avg_word_length", "n_low_valence"]) # Rescales multiple specific features

    # Rescale extracted features to a custom range
    extractor.rescale("all", minimum = 0, maximum = 10) # Rescales all extracted features to a range of 0 to 10


Specifying the model, language, text column, maximum length, and the used resources
===================================================================================
By default, the Extractor class uses the spaCy backbone and the `en_core_web_sm` model, the column `text`, and a maximum length of 100,000 tokens for feature extraction. However, you can specify the model, language, text column, and maximum length of the text to process by passing the respective parameters to the Extractor class.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame
    # This will automatically load the specified model
    # and preprocess the text column
    # Assumes the text column is named "comment"
    extractor = Extractor(data = df,
                          language = "de",
                          model = "de_dep_news_trf",
                          text_column = "comment",
                          max_length = 10000)

    # Extract features
    extractor.extract_features()

    print(extractor.data.head())

Extracting features using a custom configuration
------------------------------------------------

In cases where you want to extract features using a specific model (either from spacy or stanza), in a specific language, or you have a specific set of features you want to extract, you can use a custom configuration.

To extract features using a custom configuration, you will need to pass a dictionary with the desired configuration to the ``extract`` method.

For example, you can extract features using the spacy backbone, in German, using the model ``de_dep_news_trf``, with a maximum length of 10,000 and only extract the average word length from the surface features and the number of low-valence words and high-valence words from the emotion features.

.. code-block:: python

    import polars as pl
    from elfen.extractor import Extractor

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Custom configuration
    custom_config = {
        "backbone": "spacy", 
        "language": "de",
        "model": "de_dep_news_trf",
        "max_length": 10000,
        "features": {
            "surface": ["avg_word_length"],
            "emotion": ["n_low_valence", "n_high_valence"]
        }
    }

    # Initialize the Extractor with your DataFrame and configuration
    extractor = Extractor(data = df, config = custom_config)

    # Extract features using a custom configuration
    extractor.extract_features()

    print(extractor.data.head())

For a full overview over available parameters in the custom configuration, check the :ref:`custom_configuration` section. 

Extracting custom lexicon-based features
----------------------------------------

In cases where you want to extract features based on a custom lexicon that do not fit into the predefined feature areas or way of processing the specific feature, we provide the possibility to extract custom lexicon-based features using some custom template functions for five potential templated features of interest: 

- ``get_n_custom``: Number of words in a text that are in a custom lexicon
- ``get_occurs_custom``: Whether or not a text contains a word from a custom lexicon
- ``get_n_custom_high``: The number of words in a text that are in a custom lexicon and have a rating above a certain threshold (given in another column of the lexicon)
- ``get_n_custom_low``: The number of words in a text that are in a custom lexicon and have a rating below a certain threshold.
- ``get_avg_custom``: The average rating of words in a text that are in a custom lexicon

To extract these custom lexicon-based features, you will need to load the respective custom lexicon as a polars DataFrame and extract the features as shown below.
Note that currently, there is no possibility to pass custom lexicon-based features in a custom configuration, so you will have to extract these features separately using the respective template functions.

.. code-block:: python

    import polars as pl

    from elfen.extractor import Extractor
    from elfen.custom import (
        get_n_custom,
        get_occurs_custom,
        get_n_custom_low,
        get_n_custom_high,
        get_avg_custom
    )

    # Load your custom lexicon as a polars DataFrame
    custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

    # Load your dataset as a polars DataFrame
    # example from csv
    df = pl.read_csv("path/to/your/dataset.csv")

    # Initialize the Extractor with your DataFrame; 
    # preprocessing will be done automatically
    extractor = Extractor(data = df)

    # Load your custom lexicon as a polars DataFrame
    df = extractor.data

    # Load your custom lexicon as a polars DataFrame
    custom_lexicon = pl.read_csv("path/to/your/custom_lexicon.csv")

    # Number of words in a text that are in a custom lexicon
    df = get_n_custom(data=df,  # DataFrame with text data
                      lexicon=custom_lexicon,  # DataFrame with custom lexicon
                      feature_name="n_custom",  # Name of the feature-column after extraction
                      word_column="word",  # Name of the column in the lexicon with the words
                      measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"
    
    # Whether or not a text contains a word from a custom lexicon
    df = get_occurs_custom(data=df,  # DataFrame with text data
                           lexicon=custom_lexicon,  # DataFrame with custom lexicon
                           feature_name="occurs_custom",  # Name of the feature-column after extraction
                           word_column="word",  # Name of the column in the lexicon with the words
                           measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

    # Number of words in a text that are in a custom lexicon and have a rating above a certain threshold
    df = get_n_custom_high(data=df,  # DataFrame with text data
                           lexicon=custom_lexicon,  # DataFrame with custom lexicon
                           threshold=0.5,  # Threshold for the rating
                           feature_name="n_custom_high",  # Name of the feature-column after extraction
                           word_column="word",  # Name of the column in the lexicon with the words
                           feature_column="rating",  # Name of the column in the lexicon with the ratings
                           measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

    # Number of words in a text that are in a custom lexicon and have a rating below a certain threshold
    df = get_n_custom_low(data=df,  # DataFrame with text data
                          lexicon=custom_lexicon,  # DataFrame with custom lexicon
                          threshold=0.5,  # Threshold for the rating
                          feature_name="n_custom_low",  # Name of the feature-column after extraction
                          word_column="word",  # Name of the column in the lexicon with the words
                          feature_column="rating",  # Name of the column in the lexicon with the ratings
                          measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

    # Average rating of words in a text that are in a custom lexicon
    df = get_avg_custom(data=df,  # DataFrame with text data
                        lexicon=custom_lexicon,  # DataFrame with custom lexicon
                        feature_name="avg_custom",  # Name of the feature-column after extraction
                        word_column="word",  # Name of the column in the lexicon with the words
                        feature_column="rating",  # Name of the column in the lexicon with the ratings
                        measurement_level="tokens")  # Measurement level of the feature; either "tokens" or "lemmas"

    print(df.head())

Limiting the numbers of cores used
----------------------------------
The underlying dataframe library, polars, uses all available cores by default.
If you are working on a shared server, you may want to consider limiting the resources available to polars.
To do that, you will have to set the ``POLARS_MAX_THREADS`` variable in your shell, e.g.:

.. code-block:: shell

    # Limit the number of threads to 8
    export POLARS_MAX_THREADS=8

.. note::
    If you do not find a suitable template function or different feature extraction function, and you implement your own, please consider contributing to the package by opening a pull request on the `GitHub repository`_.

.. _GitHub repository: https://www.github.com/mmmaurer/elfen