Stars
Scrapy, a fast high-level web crawling & scraping framework for Python.
A collaborative note taking, wiki and documentation platform that scales. Built with Django and React.
The batteries-included, No-Code FinOps automation platform, with the AI you trust.
A Q&A platform software for teams at any scales. Whether it's a community forum, help center, or knowledge management platform, you can always count on Apache Answer.
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Open, Multi-modal Catalog for Data & AI
Dump the license list of packages installed with pip.
Streamlit — A faster way to build and share data apps.
PyGWalker: Turn your dataframe into an interactive UI for visual analysis
Automatically exported from code.google.com/p/passlib
MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
Bonus materials, exercises, and example projects for our Python tutorials
A simplified, lightweight ETL Framework based on Apache Spark
A curated list of awesome Apache Spark packages and resources.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡
Python Sorted Container Types: Sorted List, Sorted Dict, and Sorted Set
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Change data capture for a variety of databases. Please log issues at https://github.com/debezium/dbz/issues.
Native cross-platform MongoDB management tool
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Apache Superset is a Data Visualization and Data Exploration Platform
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Apache Spark - A unified analytics engine for large-scale data processing
A curated list of useful resources for gRPC

