Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

99 diachronic features #100

Merged
merged 7 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ jobs:

- name: Publish
run: |
poetry publish -u ${{ secrets.PYPI_UNAME }} -p ${{ secrets.PYPI_PWD }}
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }}
poetry publish
- name: Upload binaries to release
uses: softprops/action-gh-release@v1
if: ${{startsWith(github.ref, 'refs/tags/') }}
Expand Down
13 changes: 6 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
=====
SINr
=====
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |docs| |activity| |contributors| |quality| |build|
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|

*SINr* is an open-source tool to efficiently compute graph and word
embeddings. Its aim is to provide sparse interpretable vectors from a
Expand Down Expand Up @@ -50,7 +50,8 @@ Usage example
=============

To get started using *SINr* to build graph and word embeddings, have a
look at the `notebook <./notebooks>`__ directory.
look at the `notebook <https://github.com/SINr-Embeddings/sinr/tree/main/notebooks>`_
directory.

Here is a minimum working example of *SINr*

Expand Down Expand Up @@ -132,7 +133,7 @@ to disccus the changes to be made.
License
=======

Released under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <./LICENSE>`__ for more details.
Released under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <https://github.com/SINr-Embeddings/sinr/blob/main/LICENSE>`__ for more details.

Publications
============
Expand All @@ -141,7 +142,7 @@ Publications
find *SINr* useful for your own research, please cite the appropriate
papers from the list below. Publications can also be found on
`publications page in the
documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.html>`__.
documentation <https://sinr-embeddings.github.io/sinr/publications.html>`__.

**Initial SINr paper, 2021**

Expand Down Expand Up @@ -184,8 +185,6 @@ documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.h
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr
.. |docs| image:: https://img.shields.io/website?url=https%3A%2F%2Fsinr-embeddings.github.io%2Fsinr%2F_build%2Fhtml%2Findex.html
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr
.. |quality| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/quality-score.png?b=main
.. |build| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/build.png?b=main

4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'SINr'
copyright = '2023, Thibault Prouteau, Nicolas Dugué, Simon Guillot'
author = 'Thibault Prouteau, Nicolas Dugué, Simon Guillot'
copyright = '2024, Thibault Prouteau, Nicolas Dugué, Simon Guillot, Anna Béranger'
author = 'Thibault Prouteau, Nicolas Dugué, Simon Guillot, Anna Béranger'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
137 changes: 90 additions & 47 deletions docs/source/presentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Overview
============

|languages| |downloads| |license| |version| |cpython| |wheel| |python| |docs| |activity| |contributors| |quality| |build|
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|

*SINr* is an open-source tool to efficiently compute graph and word
embeddings. Its aim is to provide sparse interpretable vectors from a
Expand Down Expand Up @@ -38,24 +38,12 @@ Requirements
Install
-------

**SINr** can be installed through ``pip`` or from source using ``poetry`` directives.
**SINr** can be installed through ``pip``.

.. tabs::
.. code:: bash

.. code-tab:: zsh pip

#Activate conda environment
conda activate sinr
pip install sinr

.. code-tab:: zsh from source

#Activate conda environment
conda activate sinr
git clone [email protected]:SINr-Embeddings/sinr.git
cd sinr
pip install poetry #poetry solves dependencies and installs SINr
poetry install #Installs SINr based on the pyproject.toml file
conda activate sinr # activate conda environment
pip install sinr


Usage example
Expand All @@ -69,30 +57,66 @@ Here is a minimum working example of SINr :

.. code:: python

import urllib
import io
import gzip
import networkit as nk
import sinr.graph_embeddings as ge


url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
graph_file = "wikipedia-votes.txt"
# Read a graph from SNAP
sock = urllib.request.urlopen(url) # open URL
s = io.BytesIO(sock.read()) # read into BytesIO "file"
sock.close()
with gzip.open(s, "rt") as f_in:
with open(graph_file, "wt") as f_out:
f_out.writelines(f_in.readlines())
# Initialize a networkit.Graph object from SNAP graph
G = nk.readGraph(graph_file, nk.Format.SNAP)

# Build a SINr model and extract embeddings
model = ge.SINr.load_from_graph(G)
model.run(algo=nk.community.PLM(G))
embeddings = model.get_nr()
print(embeddings)
import nltk # For textual resources

import sinr.text.preprocess as ppcs
from sinr.text.cooccurrence import Cooccurrence
from sinr.text.pmi import pmi_filter
import sinr.graph_embeddings as ge
import sinr.text.evaluate as ev

# Get a textual corpus
# For example, texts from the Project Gutenberg electronic text archive,
# hosted at http://www.gutenberg.org/
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
file = open("my_corpus.txt", "w")
file.write(gutenberg.raw())
file.close()

# Preprocess corpus
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
ppcs.Corpus.LANGUAGE_EN,
"my_corpus.txt"),
".", n_jobs=8)
vrt_maker.do_txt_to_vrt()
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)

# Construct cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=5)
c.matrix = pmi_filter(c.matrix)
c.save("my_cooc_matrix.pk")

# Train SINr model
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
commu = model.detect_communities(gamma=10)
model.extract_embeddings(commu)

# Construct SINrVectors to manipulate the model
sinr_vec = ge.InterpretableWordsModelBuilder(model,
'my_sinr_vectors',
n_jobs=8,
n_neighbors=25).build()
sinr_vec.save()

# Sparsify vectors for better interpretability and performances
sinr_vec.sparsify(100)

# Evaluate the model with the similarity task
print('\nResults of the similarity evaluation :')
print(ev.similarity_MEN_WS353_SCWS(sinr_vec))

# Explore word vectors and dimensions of the model
print("\nDimensions activated by the word 'apple' :")
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))

print("\nWords similar to 'apple' :")
print(sinr_vec.most_similar('apple'))

# Load an existing SinrVectors object
sinr_vec = ge.SINrVectors('my_sinr_vectors')
sinr_vec.load()


Contributing
Expand All @@ -115,13 +139,35 @@ Publications can also be found on :ref:`Publications`.

**Initial SINr paper, 2021**

- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez,
Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse
Interpretable Node Representations is not a Sin!. Advances in
Intelligent Data Analysis XIX, 19th International Symposium on
Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal.
pp.325-337,
⟨\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩.
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__

**Interpretability of SINr embedding**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier.
Are Embedding Spaces Interpretable? Results of an Intrusion Detection
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__

**Sparsity of SINr embedding**

- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨`10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`_⟩. `⟨hal-03197434⟩ <https://hal.science/hal-03197434>`_
- Simon Guillot, Thibault Prouteau, Nicolas Dugué.
Sparser is better: one step closer to word embedding interpretability.
IWCS 2023, Nancy, France.
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__

**Interpretability of SINr embeddings, 2022**
**Filtering dimensions of SINr embedding**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`_
- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
Filtering communities in word co-occurrence networks to foster the
emergence of meaning. Complex Networks 2023, Menton, France.
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__

.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr
.. |downloads| image:: https://img.shields.io/pypi/dm/sinr
Expand All @@ -130,8 +176,5 @@ Publications can also be found on :ref:`Publications`.
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr
.. |docs| image:: https://img.shields.io/website?url=https%3A%2F%2Fsinr-embeddings.github.io%2Fsinr%2F_build%2Fhtml%2Findex.html
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr
.. |quality| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/quality-score.png?b=main
.. |build| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/build.png?b=main
16 changes: 13 additions & 3 deletions docs/source/publications.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ Publications
**Initial SINr paper, 2021**


- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨`10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`_⟩. `⟨hal-03197434⟩ <https://hal.science/hal-03197434>`_
- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩.
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__

**Interpretability of SINr embedding**

**Interpretability of SINr embeddings, 2022**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`_
**Sparsity of SINr embedding**


- Simon Guillot, Thibault Prouteau, Nicolas Dugué. Sparser is better: one step closer to word embedding interpretability. IWCS 2023, Nancy, France. `⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__

**Filtering dimensions of SINr embedding**


- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau. Filtering communities in word co-occurrence networks to foster the emergence of meaning. Complex Networks 2023, Menton, France. `⟨hal-04398742v1⟩ <https://hal.science/hal-04398742v1>`__
8 changes: 8 additions & 0 deletions docs/source/sinr.text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@ Preprocess Text
:members:
:undoc-members:
:show-inheritance:

Evaluate
---------------------------

.. automodule:: sinr.text.evaluate
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------
Expand Down
Loading
Loading