ChaM3Leon is a modular and scalable framework designed to support machine learning applications - emphasising transparency, interoperability, and usability. It implements a custom lambda architecture, and additional components designed to tackle the limitation of the Speed-Batch coupling for data ingestion and processing.
Being a framework, its layers are abstractions that need to be implemented. To implement your own version of any abstract layer you have to:
- Build the project running at the level of the chaM3Leon pom.xml the following command:
mvn clean install
- Generate a Maven project and add chaM3Leon as dependency on your maven pom.xml:
<dependency>
<groupId>com.smartshaped</groupId>
<artifactId>chameleon</artifactId>
<version>0.0.1</version>
</dependency>
- Add the maven-shade-plugin to generate a shaded jar, in order to submit your layer implementation as a Spark application (keep in mind the framework is based on Java 11):
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.6.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
</resource>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
After this process has been completed, you can choose to extend any of the following layers:
To develop a batch application using the Batch Layer:
- Make sure that the class constructor is public.
- Declare this class in the YAML file along with the kafka topics configurations (batch.kafka.topics.<topic_name>.class).
- Override the
preprocess
method to add custom preprocessing for the incoming data streaming. - You can define a Preprocessor for each of the declared kafka topics.
3. (OPTIONAL, only if you want to export custom metrics) Create a Class that Extends com.smartshaped.chameleon.batch.BatchUpdater
- Make sure that the class constructor is public.
- Declare this class in the YAML file (batch.updater.class).
- Override the
updateBatch
method to implement the specific logic (working on Spark Dataframe). - Results will be automatically saved on Cassandra DB.
- Define the table fields as class attributes.
- Specify the name of the primary key as a string.
- Create a
typeMapping.yml
file to define the mapping between Java field types and CQL (Cassandra Query Language) types. - Declare this class in the YAML file (batch.cassandra.model.class).
- Call the
start
method ofBatchLayer
inside themain
method. - Specify this class in the
spark-submit
command.
To develop a batch application using the Speed Layer:
- Make sure that the class constructor is public.
- Make sure that the class constructor is public.
- This class allows you to export partial analyses/statistics from your window-time streaming data.
- Declare this class in the YAML file (speed.updater.class).
- Override the
updateSpeed
method to implement the specific logic (working on Spark Dataframe). - Results will be automatically saved on Cassandra DB.
- Define the table fields as class attributes.
- Specify the name of the primary key as a string.
- Create a
typeMapping.yml
file to define the mapping between Java field types and CQL (Cassandra Query Language) types. - Declare this class in the YAML file (speed.cassandra.model.class).
- Call the
start
method ofSpeedLayer
inside themain
method. - Specify this class in the
spark-submit
command.
To develop a machine learning application using the ML Layer:
- Make sure that the class constructor is public.
- Make sure that the class constructor is public.
- Declare this class in the YAML file, along with the HDFS path from which the data will be read.
- Optionally, override the
processRawData
method to add custom processing for the raw data.
- Declare this class in the YAML file.
- Override the
start
method to implement the specific machine learning logic. - Make sure that the
setModel
andsetPredictions
methods are called at the end of the pipeline.
- Make sure that the class constructor is public.
- Declare this class in the YAML file.
- Define the table fields as class attributes.
- Specify the name of the primary key as a string.
- Create a
typeMapping.yml
file to define the mapping between Java field types and CQL (Cassandra Query Language) types. - Declare this class in the YAML file.
- Call the
start
method ofMLLayer
inside themain
method. - Specify this class in the
spark-submit
command.
To generate the .jar
file, run the following command from your project directory:
mvn clean install
Then, follow the Docker documentation