機器學習工具大整理 Collections of Machine Learning Tools

Best of ml: https://github.com/ml-tooling/best-of-ml-python

EDA make easy : Pandas-Profiling, Sweetviz,  Autoviz, D-Tale

Classification metrics
  • Confusion Matrix
  • ROC AUC
  • Gini Coefficient
  • Gain and Lift Charts
  • KS Chart (Kolmogorov-Smirnov)
Regression metrics
  • MSE, RMSE
  • MAE
  • MAD
  • RAE (Relative Absolute Error)
  • RSE (Relative Squared Error)
  • R-squared and Adj R-squared
  • Analysis of Residuals
Competition
  • Feature Selection: eli5, lofo,
  • Data loading: Faker, Tensorflow Datasets, datasets, Pdfminer.six
  • Imbalanced data: imblearn
  • Parameter Optimization: Optuna, Keras Tuner, skopt(scikit-optimize), Hyperopt
  • AutoML: H2O, NNI (Neural Network Intelligence), auto sklearn, auto keras, TPOP
  • Algo: Lightgbm, Xgboost, Catboost, Lazypredict
  • More effetive pandas: Vaex
  • Model Interpretability: LIME, SHAP, interpret, alibi
  • Missing value imputer: sklearn.impute.IterativeImputer
  • Training ( Workflow & Experiment Tracking): Tensorboard, MLFlow, TensorWatch, Data Version Control(DVC), Metaflow
NLP
  • NLP: Kashgari, FastNLP, TextBlob
  • QA: NeuralQ
  • NER: NeuroNER
  • Neural Relation Extraction(NRE): OpenNRE
  • Label: Docanno
  • Seq2Seq: fairseq
CV
  • Image: Pillow, torchvision, scikit-image
  • Image model: PyTorch Image Models, GluonCV
  • Image label: labelImg
  • Object detection: deectron2, mmdectection
  • Face Recognition: face_recognition, facenet_pytorch
  • Segmentation: segmentation_models
  • Image data augmentation: imgaug, Albumentations, Augmentor
  • Finding duplicate image: imagededup
  • Explainable : cv2.saliency
  • Faster image loading : libjpeg-turbo, PyVips
Time Series
  • Time Series Feature extraction: tsfresh
  • TS smoothing and outliner dectection: tsmoothie
  • Time Series forecasting: Prophet and NeuralProphet,  sktime, pytorch-forecasting, pmdarima
  • Better datetime: python-dateutil
  • generalized framework : Kats
  • markov model: Deeptime
OCR
  • OCR: Tesseract, EasyOCR, PaddleOCR
Medical
  • Medical: MNE-Python, Nilearn, Lifelines
Recommendation System
  • Building and analysis : recommenders, torchrec, TensorFlow Recommenders, Pyspark.mlib.recommendation, surprise
  • Collaborative Filtering : Implicit
  • Factorization Machine : lightFM
Face detection
  • Facenet, OpenFace, VGG-Face, DeepFace, Dlib
Visualization
  • Data visualization: Plotly, Bokeh, Holoviews, Datashader pydantic, schema
  • Ploting Architecture: PlotNeuralNet
  • High Dimensional Data :  UMAP
Production
  • Training pipeline: Kubeflow, Airflow,  Prefect, Metaflow, Tensorflow Extended
  • Data Versioning: DVC
  • Data Validation: TensorFlow Data Validation (TFDV), datatest
  • Distributed: Ray, PySpark, DeepSpeed
  • Model Monitoring:  Seldon, MLWatcher
  • Model registry : MLFlow
  • Explainable : SHAP
  • Experiment Monitoring : TensorBoard, Weights & Biases
  • Measurement of  Model time : PyTorch Profiler, Tensorflow Profiler
  • Code Review: ReviewNB
  • API: Flask, FastAPI
  • Large scale: Pyspark, TensorFlowOnSpark, Horovod, BigDL
  • Python SQL: BlazingSQL, dask-sql
  • CI/CD: GoCD, AutoRABIT
  • C++: Dlib, mlpack
  • Label : labelstudio
  • Model testing: checklist (NLP bias)
low-code
  • pycaret
Probabilistic Programming
  • PyMC
  • pyro


留言

這個網誌中的熱門文章

為什麼只能在訓練資料上平衡不平衡的資料集? Why should you deal with an imbalanced dataset only on training data?

十種常見的軟體架構模式 10 Common Software Architectural Patterns

如何得到和 Anaconda 的 Jupyter Notebook 一樣的使用者體驗但卻不需安裝 Anaconda ?