Data Science And Machine Learning. With Java?
329
0
·
2020/12/22
·
6 mins read
☕
WriterShelf™ is a unique multiple pen name blogging and forum platform. Protect relationships and your privacy. Take your writing in new directions. ** Join WriterShelf**
WriterShelf™ is an open writing platform. The views, information and opinions in this article are those of the author.
Article info
Categories:
Tags:
Date:
Published: 2020/12/22 - Updated: 2021/01/18
Total: 1450 words
Like
or Dislike
More to explore
Common Applications of Data Science
The blogosphere is full of descriptions about how data science and “AI’ is changing the world. In financial services, applications include personalized financial offers, fraud detection, risk assessment (e.g. loans), portfolio analysis and trading strategies, but technologies are relevant elsewhere, e.g. customer churn in telecomms, personalized treatment in healthcare, predictive maintenance for manufacturers, and demand forecasting in retail.
These applications outlined are largely not new, nor are "AI" algorithms like neural networks. However, increasingly commoditized, flexible and cheaper hardware with readily available algorithms and APIs have lowered barriers to data-compute intensive approaches common to data science, making the use of "AI" algorithms much more straightforward.
Key Definitions: Machine Learning, Data Science, etc
For practitioners, definitions are well understood. For those less familiar and curious, here are some quick definitions and introductions to baseline everyone.
At their heart, data science workflows transform data, from heterogenous sources of information, through models and learning, to derive information from which “useful” decisions can be expedited. Decisions may be automated (e.g. an online search or a retail credit fraud check) or inform human decisions (e.g. portfolio manager investment decisions or a complex corporate lending negotiation). For more info Java Training
"Most cloud-native-type companies need five data engineers for each data scientist to get the data into the form and location needed for good data science," said Jason Preszler, head data scientist at Karat, a technical hiring service. "Without both roles, the data [that] companies are easily collecting is just sitting around or underutilized."
I’ve seen exceptional domain-specialists-come-data scientists also be CTO-like unicorns, bridging the gap between algorithm, implementation and business insight. I've also seen enterprise architects and CTOs, particularly those gifted with both soft skills and AI-focused STEM PhDs, drive algorithmic research, a chance perhaps to relive their university days. Their direction in turn helps specialists deploy individual tasks, such as algorithms and research, data munging and warehousing, software and application development right through to business-level reporting and, if applicable, automated activity execution.
Now let’s briefly examine some key algorithmic terminologies, important because we’ll return to them later in the article when exploring emerging Java capabilities:
Machine Learning itself uses labeled training data to predict future values, essentially learn from example. Supervised (which trains a model on known inputs and outputs) and unsupervised learning (finds hidden patterns or intrinsic structures in input data) can both apply.
In deep learning, a computer model learns to perform classification tasks directly from images, text, signals or sound. Learn more knowledge from learn Java Online
Among various deep learning algorithms, I’ve interacted with two significantly in my financial services life
Reinforcement Learning: This approach utilizes a human-like trial and error “agent-based” approach to reinforce paths that work and discard paths that don’t. Such approaches are popular in search, retail and trading strategies, as they can mimic complex human behavior. They are also applied in ADAS (Advanced-Driver Assistance Systems) applications, intersecting well with the human-machine interface on which such systems depend.
Why Java in Your Data Science Workflows?
All languages are beautiful, their individual beauty often lies in the eye of the beholder. Open source languages Python and R since 2010-15 have dominated upstream Data Science, prior to that the commercial language MATLAB in which many game-changing early neural nets algorithms were implemented. Views differ on how far Python and R extend into the enterprise stack. In research, R has a rich statistical library ecosystem while key libraries like Tensorflow, PyTorch and Keras are accessible from Python, facilitated by the SciPy stack and Pandas. However, other languages are coming to the fore, including Java, C++ and .NET. Gartner machine learning guru, Andriy Burkov, eloquently writes:
"Some people working in data analysis think that there's something special about Python (or R, or, Scala).
They will tell you that you have to use one of those because otherwise, you will not get the best result. It's not true. The choice of language should be made based on two factors: 1) how well your final product will integrate with the existing ecosystem and 2) the availability of production-grade data analysis libraries for each language. For more info Java Online Training
Currently, almost any popular language has one or more powerful libraries for data analysis. Java is an excellent example, where the development of everything hot is happening right now because of a multitude of existing JVM languages. C++ historically has a huge choice of implemented algorithms. Even proprietary ecosystems such as .NET today contain implementations of most of the state-of-the-art algorithms and learning paradigms. So, if someone tells you that only Python is the way to go, I would be skeptical and look for someone who embraces diversity."
Great advice. Two key points primarily from the Java perspective:
i) Data science algorithms “upstream” particularly for statistics, machine learning and deep learning methodologies (neural nets), hitherto the province of Python, R and MATLAB, are more readily available across more languages. In Java, for example the following frameworks are emerging:
- DeepLearning4J includes a Toolkit for building, training and deploying neural networks. RL4J extend with reinforcement learning targets image processing and includes Markov Decision Processes (MDP) and Deep Q Network (DQN) methods
- ND4J: Key scientific computing libraries for JVM use, modeled on NumPy and core MATLAB, including deep learning capabilities.
- Amazon Deep Java Library: Develop and deploy machine and deep learning models, drawing on MXNet, PyTorch and TensorFlow frameworks.
These and other capabilities make Java accessible to developer-savvy scientific programmers.
Note that commercial "upstream" environments such as SAS, KNIME and RapidMiner offer data science platforms with strong Java foundations. MATLAB too has historically integrated well with Java for application development and API connectivity, a theme in Yair Altman’s Java/MATLAB aging classic, Undocumented MATLAB. The MATLAB Production Server is one of several vehicles to deploy MATLAB algorithms into Java enterprise applications. In your Java code, you can define a Java interface to represent the deployed MATLAB function, instantiate a proxy object to communicate with the Production Server, and thus call the MATLAB-generated function.
You can also interface and deploy open source R code to Java in many ways, including via the package RServe. Learn more from Java Online Classes
In short, there are increasing capabilities to code (production-ready-ish) data science algorithms in Java and if not in Java then call other languages from Java. Python (with NumPy; SciPy; Pandas), R and MATLAB will surely remain algorithmic domain leaders given their matrix algebra, tech computing and statistical foundations, but Java and other languages are increasingly compelling.
A quick nod to C++. Remember that key “Python” libraries have strong C++ foundations including Tensorflow and PyTorch. Away from algorithms and toward data engineering, Python Pandas lead Wes McKinney, for example, has highlighted the relevance of C++ to the multi-platform Arrow Project.
ii) Data science enterprise architectures “downstream,” particularly those focusing on secure data throughput, are often Java-based and/or underpinned in platforms or languages (e.g. Scala or Clojure) using the Java Virtual Machine [JVM], such as:
When maintained by careful JVM tuning or by swapping in a high performance JVM, these already high performance applications become even more performant and glitch-free.
Java excels in distributed environments. Secure data handling, manipulation, transfer and connectivity are among its natural strengths, benefiting too from a coordinated strategy around security, enforced over the years by Sun Microsystems, Oracle and now the vibrant OpenJDK organization. The cross-platform approach underpinned by the JVM, i.e. develop once, deploy anywhere, facilitates enterprise development. Key projects, for example, Project Panama, enhance ease of access, and will bring compute-intensive deep learning-friendly CUDA and OpenCL-based libraries and GPU hardware within easier reach.
In conclusion, Java is prominent in enterprise architectures, but increasing in versatility in “upstream” data science-enabling algorithmic capabilities. It will operate in conjunction with Python, R, MATLAB, C++ and others and not instead of them, but possibilities are increasingly available to use Java across all aspects of data science workflows.
To get in-depth knowledge, enroll for a live free demo on Java Programming Course