Skip to content

Latest commit

 

History

History
390 lines (250 loc) · 36.9 KB

datasheet.md

File metadata and controls

390 lines (250 loc) · 36.9 KB

Datasheet EIE

This datasheet covers both the prediction tasks we introduce and the underlying EIE data sources.

Motivation

  • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled?

    Creating this dataset involved:

    • Collecting raw data about student’s performance during the standardized tests EIE (External Independent Evaluation) and NMT (National Multi-subject Test) by the Ukrainian Center for Educational Quality Assessment (UCEQA), a budgetary institution in the Ministry of Education and Science's management sphere and about governmental expenditures on education by the Ministry of Finance;
    • Cleaning, matching, and processing this raw data by our research team.

    UCEQA releases the corresponding raw data each year for two reasons:

    1. They must provide public information on their website by Public Information Law:
      "Public information in the form of open data is public information in a format that allows for its automated processing by electronic means, open and free access to it, and its further use.
      Information administrators must provide public information in the form of open data upon request, publish it, and regularly update it on the unified governmental open data web portal and their websites."
    2. They provide this information for anyone interested in conducting independent research on the results of educational assessments.

    The Ministry of Finance released the corresponding raw data for school statistics to provide public information about expenditures on education by Public Information Law.

  • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

    The processed dataset was created from available raw data on UCEQA and the Ministry of Finance portals by Dr. Julia Stoyanovich, Andrew Bell, and Falaah Arif Khan from the Center for Responsible AI, New York University, and Dr. Tetiana Zakharchenko, Nazarii Drushchak, and Oleksandra Konopatska from Ukrainian Catholic University.

  • Who funded the creation of the dataset?

    The collection of raw data was funded by the Ukrainian government. The cleaning, matching, and pre-processing of the raw data collected were funded by the Center for Responsible AI, New York University.

Composition

  • What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)?

    The datasets represent the information about the participants’ EIE (External Independent Evaluation) from 2016 to 2021 and the NMT (National Multi-subject Test) from 2022 to 2023. The test participants are those who want to use the test results to get the Certificate of Complete General Secondary Education or for admission to higher education institutions to study for a bachelor’s degree.

    There are multiple types of instances from different datasets:

    • Locations: each instance represents a geographical location.
    • Schools: each instance represents an organization.
    • School Statistics: each instance represents an organization.
    • Students: each instance represents an individual.
    • Students take tests: each instance represents an individual.
    • Test centers: each instance represents an organization.
    • Tests: each instance represents a subject.
    • and Year: each instance represents a year.
  • How many instances are there in total (of each type, if appropriate)?

    The following table describes the sizes of the datasets

    Tables Features Datapoints
    Locations 5 31,862
    Schools 4 82,513
    School Statistics 12 17,636
    Students 10 2,490,052
    Students take tests 10 10,597,976
    Test centers 4 3,358
    Tests 2 25
    Year 1 8
  • Does the dataset contain all possible instances, or is it a sample (not necessarily random) of instances from a larger set?

    The dataset contains all possible instances of students who have registered for participation in EIE or NMT in the corresponding year. Some of the collected features (name, date of birth, etc.) are hidden according to Ukraine’s law, “On Personal Data Protection."

  • What data does each instance consist of?

    Each instance consists of features. Jupiter notebook Tables_Description describes each feature in the dataset.

  • Is there a label or target associated with each instance?

    The dataset does not have a prediction task and, therefore, has no label. However, we chose to create an associated prediction task and use the exam results as labels. There are different types of results: test status (whether an individual passed the exam), grade on the raw scale (depends on the subject and year), grade in rating score 100-200 (used for university admission), and grade іn a 12-point scale (to be used for final assessments at school).

  • Is any information missing from individual instances?

    Some features (e.g., language of the class or raw score in Students) contain missing values. This is due to the fact that in some years, some information was not collected or published. The chart below shows for which years and datasets a certain feature is available:

    Features/Years Table name 2016 2017 2018 2019 2020 2021 2022 2023
    KATOTTG_2023 Schools + + + + + + + +
    EDRPOU Schools + + + + + + + +
    eotypename Schools + + + + + + + +
    year Schools + + + + + + + +
    EDRPOU Schools_Stats - - - + + - - -
    eotype Schools_Stats - - - + + - - -
    eolevel Schools_Stats - - - + + - - -
    teachstuff Schools_Stats - - - + + - - -
    nonteachstuff Schools_Stats - - - + + - - -
    teachstuffretage Schools_Stats - - - + + - - -
    pupils Schools_Stats - - - + + - - -
    classes Schools_Stats - - - + + - - -
    opex Schools_Stats - - - + + - - -
    opexplan Schools_Stats - - - + + - - -
    hub Schools_Stats - - - + + - - -
    year Schools_Stats - - - + + - - -
    outid Students + + + + + + + +
    birth Students + + + + + + + +
    sextypename Students + + + + + + + +
    classprofilename Students - + + + + + - -
    regtypename Students + + + + + + + +
    classlangname Students - + + + + + - -
    KATOTTG_2023 Students + + + + + + + +
    EDRPOU_school Students + + + + + + + +
    year Students + + + + + + + +
    status Students + + + + + + + +
    outid Students_Take_Tests + + + + + + + +
    year Students_Take_Tests + + + + + + + +
    score100 Students_Take_Tests + + + + + + + +
    score12 Students_Take_Tests + + + + + + - -
    score Students_Take_Tests - - + + + + + +
    test_status Students_Take_Tests + + + + + + + +
    test_subject Students_Take_Tests + + + + + + + +
    test_type Students_Take_Tests + + + + + + + +
    KATOTTG_2023_test_center Students_Take_Tests + + + + + + + +
    EDRPOU_test_center Students_Take_Tests + + + + + + - -
    KATOTTG_2023 Test Centers + + + + + + - -
    year Test Centers + + + + + + - -
    EDRPOU Test Centers + + + + + + - -
  • Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)?

    Yes, they are explicit. To understand the relationships between instances, see the following ER Diagram:

  • Are there any errors, sources of noise, or redundancies in the dataset?

    The publicly available data from the Ukrainian Center for Educational Quality Assessment contains errors, which meant we had to clean and standardize it. The primary challenge in working with this data is the lack of consistent data entry practices across years (and sometimes even within the same year), the lack of a unified format of the data year to year, and errors from manual data entry. Supplementary materials in the associated paper describe the whole cleaning process. Some examples of issues include different test taking requirements for students from year to year, typos when entering school names, and changes in the administrative divisions of Ukraine.

  • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

    The raw data was obtained through an external reference: he open data portal of the Ukrainian Center for Educational Quality Assessment. In this repositroy, we provide Jupyter notebooks and scripts to generate the processed and cleaned datasets from the original source.

    1. Access to this dataset is guaranteed by the Ukranian law on the access to public information. Information administrators are obliged to provide public information in the form of open data upon request, publish, and regularly update it on the unified state open data web portal and on their websites. These datasets are published via an open data portal.
    2. You can find the official archival versions of the datasets on the open data portal. These datasets are depersonalized (sensitive features are hidden) according to the Law of Ukraine "On Personal Data Protection."
    3. There are no restrictions for data usage, and the law of Ukraine guarantees it: “Public information in the form of open data is authorized for its further free use and distribution. Any person may freely copy, publish, distribute, use, including for commercial purposes, in combination with other information or by incorporating into their own product public information in the form of open data with a mandatory reference to the source of such information.”

    Article 101. Public information in the form of open data

    The datasets about participants and their EIE or NMT results are published from August to September of the same year, starting in 2016. Annual education expenditure datasets are published around January-February of the following year, starting in 2018, but in 2022 publication was stopped due to the war in Ukraine.

  • Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?

    Personally Identifiable Information (PII) in these datasets is depersonalized and protected in accordance with the Law of Ukraine "On Personal Data Protection".

  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

    No.

  • Does the dataset relate to people?

    Yes, each instance in the Student and Students_take_tests datasets corresponds to a person.

  • Does the dataset identify any subpopulations (e.g., by age, gender)?

    The dataset contains identifying information for subpopulations, since each individual has features like year of birth, gender, living location (urban or rural), and type of school.

  • Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?

    To the best of our knowledge and according to the law of Ukraine, "On Personal Data Protection," it is impossible to identify individuals directly from the datasets. However, the possibility of reconstruction attacks combining data from the UCEQA and other data sources is a concern.

  • Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?

    The datasets contain features such as a year of birth or gender that are often considered sensitive.

Collection process

  • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)?

    The Ukrainian Center for Educational Quality Assessment (UCEQA) collected the data. The participants provided some of the information during registration (year of birth, gender, school). The corresponding departments of UCEQA verified this data. Information about the test centers and the results of EIE or NMT tests was added by UCEQA, as well as depersonalized data of all test participants. Results of EIE or NMT tests were obtained after the tests via defined evaluating systems." More details on the data are provided at open data portal.

  • What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?

    The dataset contains the following:

    • Demographic data of test participants and their educational institutions were collected during registration. For EIE, data was provided by participants through the paper registration form and was used for manual human curation after. For NMT, data was provided by participants through the creation of a personal account on the website of the Ukrainian Center, entering personal data and information on participation in the ICT (Information and Communication System) of the UCEQA, and submission of the entered information and copies of documents to the regional center for processing.
    • EIE and NMT results, obtained from direct tests results (evaluating by machine in the case of EIE since it was paper-based and corresponding software in the case of NMT since it is computer-based), results of open (written) assignments in the case of EIE (evaluating by experts), corrected via the appeal process, and their conversion to various evaluating systems.
    • Information about test centers was created by UCEQA during the organizational procedure.
  • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

    The dataset contains all possible instances of people who have registered for participation in EIE or NMT in the corresponding year. Some of the collected features (name, date of birth, etc) are hidden according to the law of Ukraine's “On Personal Data Protection".

  • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

    Thoe staff of the EIE were from the Ministry of Education and Science of Ukraine, the Ukrainian Center for Educational Quality Assessment, regional centers for educational quality assessment and the executive authority of the Autonomous Republic of Crimea in the field of education, structural units for education and science of regional, Kyiv and Sevastopol city state administrations. The data was collected by employees of all these organizations. We have no information on who was involved in the data collection process personally and how much they were paid.

    The subjects of the NMT were the Ministry of Education and Science of Ukraine, the Ukrainian Center for Educational Quality Assessment, regional centers for educational quality assessment, structural units for education and science of regional, Kyiv city state and military administrations, and higher education institutions. The data was collected through the ICT (Information and Communication System) of the UCEQA and the software where participants passed the test.

  • Over what timeframe was the data collected?

    Data was collected from 2016 to 2023 years.

  • Were any ethical review processes conducted (e.g., by an institutional review board)?

    As a government agency, the Ukrainian Center for Educational Quality Assessment is subject to government oversight mechanisms. For example, the testing process is in line with the provisions of the Convention for the Protection of Human Rights and Fundamental Freedoms and the case law of the European Court of Human Rights. Also, the process was approved by the All-Ukrainian Organization of Disabled People "Union of Organizations of Disabled People of Ukraine". Finally, The Ministry of Education and Science of Ukraine conducted an anti-discrimination examination.

  • Were the individuals in the dataset notified about the data collection?

    Yes. It is written in the Order of the Ministry of Education and Science of Ukraine: “The fact of receiving the registration card at the processing point is the basis for processing personal data in the process of preparing and conducting external independent evaluation, their use during admission to higher education institutions in accordance with the requirements of the Law of Ukraine "On Personal Data Protection"” that publicly available on the website of Ukrainian Center for Educational Quality Assessment each year.

  • Did the individuals in the dataset consent to the collection and use of their data?

    Participants consent by sending the registration documents to the Ukrainian Center for Educational Quality Assessment or by submitting the information through the Information and Communication System for the NMT.

  • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?

    No.

  • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?

    No.

Preprocessing/cleaning/labeling

  • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

    We used datasets from the open data portal of the Ukrainian Center for Educational Quality Assessment. A leading challenge with working with this open data over the years is a lack of standard data entry practices (sometimes even within the same year) and the lack of a unified format of the data released year to year. In order to overcome this, we took several manual and automated data-cleaning steps. After that, the data was transformed into normalized tables and loaded into a PostgreSQL database. The main cleaning steps were connected with location data, which is highly inconsistent year over year, due to the significant decommunization and decentralization that took place in Ukraine between 2016 and 2023. Other important cleaning steps involved the names of educational institutions because the raw data has many inconsistencies caused by renaming of the institutions several times and manual data entry.

    Supplementary materials in the associated paper describe the whole cleaning process.

  • Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

    The raw data from the Ukrainian Center for Educational Quality Assessment (UCEQA) can be found here: https://zno.testportal.com.ua/opendata. The raw data from the Ministry of Finance can be found here https://mof.gov.ua/uk/the-reform-of-education.

  • Is the software used to preprocess/clean/label the instances available?

    The code used to clean the data is all open source and available on the GitHub page https://github.com/DataResponsibly/ZNO-Dataset.

Uses

  • Is there a repository that links to any or all papers or systems that use the dataset?

    All the coding and data work for the associated paper can be found in this GitHub repository.

  • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

    The dataset is clean and complete. The user should know that across years, the minimum score for passing the exam for each subject may be defined differently (please see the Appendix). One must be careful when comparing two years of data for one subject.

  • Are there tasks for which the dataset should not be used? If so, please provide a description.

    This dataset contains personal information, and users should not attempt to re-identify individuals in it.

Distribution

  • Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

    The dataset is available for download in this GitHub repository.

  • Is the dataset distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

    The code in this repository is available under the MIT license. The EIE data itself is based on data from the Ukrainian Center for Educational Quality Assessment (UCEQA), public files managed by the UCEQA. For more information, see https://zno.testportal.com.ua/opendata. There are no restrictions for data usage, and it is guaranteed by the law of Ukraine “Public information in the form of open data is authorized for its further free use and distribution. Any person may freely copy, publish, distribute, use, including for commercial purposes, in combination with other information or by incorporating into their own product public information in the form of open data with a mandatory reference to the source of such information.” Article 101. Public information in the form of open data

  • Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

    No

  • Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

    To our knowledge, no export controls or regulatory restrictions apply to the dataset.

Maintenance

  • Who is supporting/hosting/maintaining the dataset?

    The dataset will be hosted on GitHub and supported and maintained by Nazarii Drushchak, Tetiana Zakharchenko, Oleksandra Konopatska, Andrew Bell, and Julia Stoyanovich.

  • How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

    Please send issues and requests to [email protected].

  • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

    The dataset will be updated as required to address errors and refine the prediction problems based on feedback from the community. The package maintainers will update the dataset and communicate these updates on GitHub.

    Older versions of the datasets in ZNO_DATASET will be clearly indicated, supported, and maintained on the GitHub website. Each new version of the dataset will be tagged with version metadata and an associated GitHub release.

  • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

    Users wishing to contribute to ZNO_DATASET are encouraged to do so by submitting a pull request to this repository. The contributions will be reviewed by the maintainers. These contributions will be reflected in the new version of the dataset and broadcasted as part of each Github release.