Releases: deepset-ai/haystack
v2.10.0-rc1
Release Notes
v2.10.0-rc1
Highlights
We are introducing the `AsyncPipeline`: Supports running pipelines asynchronously. Schedules components concurrently whenever possible. Leads to major speed improvements for any pipelines that may run workloads in parallel.
Major refactoring of Pipeline.run() to fix multiple bugs. We moved from a mostly graph-based to a dynamic dataflow driven execution logic. While most pipelines should remain unaffected, we recommend carefully checking your pipeline executions to ensure their output hasn't changed.
Upgrade Notes
- The DOCXToDocument converter now returns a Document object with DOCX metadata stored in the meta field as a dictionary under the key docx. Previously, the metadata was represented as a DOCXMetadata dataclass. This change does not impact reading from or writing to a Document Store.
- Removed the deprecated NLTKDocumentSplitter, it's functionalities are now supported by the DocumentSplitter.
- The deprecated FUNCTION role has been removed from the ChatRole enum. Use TOOL instead. The deprecated class method ChatMessage.from_function has been removed. Use ChatMessage.from_tool instead.
New Features
-
Added a new component ListJoiner which joins lists of values from different components to a single list.
-
Introduced the OpenAPIConnector component, enabling direct invocation of REST endpoints as specified in an OpenAPI specification. This component is designed for direct REST endpoint invocation without LLM-generated payloads, users needs to pass the run parameters explicitly.
Example:
`python from haystack.utils import Secret from haystack.components.connectors.openapi import OpenAPIConnector connector = OpenAPIConnector( openapi_spec="https://bit.ly/serperdev_openapi", credentials=Secret.from_env_var("SERPERDEV_API_KEY"), ) response = connector.run( operation_id="search", parameters={"q": "Who was Nikola Tesla?"} )
` -
Adding a new component LLMMetadaExtractor which can be used in an indexing pipeline to extract metadata from documents based on a user given prompt, and return the documents with the metadata field with the output of the LLM.
-
Add support for Tools in the Azure OpenAI Chat Generator.
-
Introduced CSVDocumentCleaner component for cleaning CSV documents.
- Removes empty rows and columns, while preserving specified ignored rows and columns.
- Customizable number of rows and columns to ignore during processing.
-
Introducing CSVDocumentSplitter: The CSVDocumentSplitter splits CSV documents into structured sub-tables by recursively splitting by empty rows and columns larger than a specified threshold. This is particularly useful when converting Excel files which can often have multiple tables within one sheet.
-
Drawing pipelines, i.e.: calls to draw() or show(), can now be done using a custom Mermaid server and additional parameters. This allows for more flexibility in how pipelines are rendered. See Mermaid.ink's [documentation](https://github.com/jihchi/mermaid.ink) for more information on how to set up a custom server.
-
Added a new AsyncPipeline implementation that allows pipelines to be executed from async code, supporting concurrent scheduling of pipeline components for faster processing.
-
Adds tooling support to HuggingFaceLocalChatGenerator
Enhancement Notes
- Enhanced SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder to accept an additional parameter, which is passed directly to the underlying SentenceTransformer.encode method for greater flexibility in embedding customization.
- Added completion_start_time metadata to track time-to-first-token (TTFT) in streaming responses from Hugging Face API and OpenAI (Azure).
- Enhancements to Date Filtering in MetadataRouter
- Improved date parsing in filter utilities by introducing _parse_date, which first attempts datetime.fromisoformat(value) for backward compatibility and then falls back to dateutil.parser.parse() for broader ISO 8601 support.
- Resolved a common issue where comparing naive and timezone-aware datetimes resulted in TypeError. Added _ensure_both_dates_naive_or_aware, which ensures both datetimes are either naive or aware. If one is missing a timezone, it is assigned the timezone of the other for consistency.
- When Pipeline.from_dict receives an invalid type (e.g. empty string), an informative PipelineError is now raised.
- Add jsonschema library as a core dependency. It is used in Tool and JsonSchemaValidator.
- Streaming callback run param support for HF chat generators.
- For the CSVDocumentCleaner, added remove_empty_rows & remove_empty_columns to optionally remove rows and columns. Also added keep_id to optionally allow for keeping the original document ID.
- Enhanced OpenAPIServiceConnector to support and be compatible with the new ChatMessage format.
- Updated Document's meta data after initializing the Document in DocumentSplitter as requested in issue #8741
Deprecation Notes
- The ExtractedTableAnswer dataclass and the dataframe field in the Document dataclass are deprecated and will be removed in Haystack 2.11.0. Check out the GitHub discussion for motivation and details: https://github.com/deepset-ai/haystack/discussions/8688
Bug Fixes
- Fixes a bug that causes pyright type checker to fail for all component objects.
- Haystack pipelines with Mermaid graphs are now compressed to reduce the size of the encoded base64 and avoid HTTP 400 errors when the graph is too large.
- The DOCXToDocument component now skips comment blocks in DOCX files that previously caused errors.
- Callable deserialization now works for all fully qualified import paths.
- Fix error messages for Document Classifier components, that suggested using nonexistent components for text classification.
- Fixed JSONConverter to properly skip converting JSON files that are not utf-8 encoded.
-
- acyclic pipelines with multiple lazy variadic components not running all components
- cyclic pipelines not passing intermediate outputs to components outside the cycle
- cyclic pipelines with two or more optional or greedy variadic edges showing unexpected execution behavior
- cyclic pipelines with two cycles sharing an edge raising errors
- Updated PDFMinerToDocument convert function to to double new lines between container_text so that passages can later by DocumentSplitter.
- In the Hugging Face API embedders, the InferenceClient.feature_extraction method is now used instead of InferenceClient.post to compute embeddings. This ensures a more robust and future-proof implementation.
- Improved OpenAIChatGenerator streaming response tool call processing: The logic now scans all chunks to correctly identify the first chunk with tool calls, ensuring accurate payload construction and preventing errors when tool call data isn’t confined to the initial chunk.
v2.9.0
⭐️ Highlights
Tool Calling Support
We are introducing the Tool
, a simple and unified abstraction for representing tools in Haystack, and the ToolInvoker
, which executes tool calls prepared by LLMs. These features make it easy to integrate tool calling into your Haystack pipelines, enabling seamless interaction with tools when used with components like OpenAIChatGenerator
and HuggingFaceAPIChatGenerator
. Here's how you can use them:
def dummy_weather_function(city: str):
return f"The weather in {city} is 20 degrees."
tool = Tool(
name="weather_tool",
description="A tool to get the weather",
function=dummy_weather_function,
parameters={
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
}
)
pipeline = Pipeline()
pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool]))
pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool]))
pipeline.connect("llm.replies", "tool_invoker.messages")
message = ChatMessage.from_user("How is the weather in Berlin today?")
result = pipeline.run({"llm": {"messages": [message]}})
Use Components as Tools
As an abstraction of Tool
, ComponentTool
allows LLMs to interact directly with components like web search, document processing, or custom user components. It simplifies schema generation and type conversion, making it easy to expose complex component functionality to LLMs.
# Create a tool from the component
tool = ComponentTool(
component=SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3),
name="web_search", # Optional: defaults to "serper_dev_web_search"
description="Search the web for current information on any topic" # Optional: defaults to component docstring
)
New Splitting Method: RecursiveDocumentSplitter
RecursiveDocumentSplitter
introduces a smarter way to split text. It uses a set of separators to divide text recursively, starting with the first separator. If chunks are still larger than the specified size, the splitter moves to the next separator in the list. This approach ensures efficient and granular text splitting for improved processing.
from haystack.components.preprocessors import RecursiveDocumentSplitter
splitter = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
doc_chunks = splitter.run([Document(content="...")])
⚠️ Refactored ChatMessage
dataclass
ChatMessage
dataclass has been refactored to improve flexibility and compatibility. As part of this update, the content
attribute has been removed and replaced with a new text
property for accessing the ChatMessage's textual value. This change ensures future-proofing and better support for features like tool calls and their results. For details on the new API and migration steps, see the ChatMessage documentation. If you have any questions about this refactoring, feel free to let us know in this Github discussion.
⬆️ Upgrade Notes
- The refactoring of the
ChatMessage
data class includes some breaking changes involvingChatMessage
creation and accessing attributes. If you have aPipeline
containing aChatPromptBuilder
, serialized withhaystack-ai =< 2.9.0
, deserialization may break. For detailed information about the changes and how to migrate, see the ChatMessage documentation. - Removed the deprecated
converter
init argument fromPyPDFToDocument
. Use other init arguments instead, or create a custom component. - The
SentenceWindowRetriever
output keycontext_documents
now outputs aList[Document]
containing the retrieved documents and the context windows ordered bysplit_idx_start
. - Update default value of
store_full_path
toFalse
in converters
🚀 New Features
-
Introduced the
ComponentTool
, a new tool that wraps Haystack components, allowing them to be utilized as tools for LLMs (various ChatGenerators). ThisComponentTool
supports automatic tool schema generation, input type conversion, and offers support for components with run methods that have input types:- Basic types (str, int, float, bool, dict)
- Dataclasses (both simple and nested structures)
- Lists of basic types (e.g.,
List[str]
) - Lists of dataclasses (e.g.,
List[Document]
) - Parameters with mixed types (e.g.,
List[Document]
, str etc.)
Example usage:
from haystack import component, Pipeline from haystack.tools import ComponentTool from haystack.components.websearch import SerperDevWebSearch from haystack.utils import Secret from haystack.components.tools.tool_invoker import ToolInvoker from haystack.components.generators.chat import OpenAIChatGenerator from haystack.dataclasses import ChatMessage # Create a SerperDev search component search = SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3) # Create a tool from the component tool = ComponentTool( component=search, name="web_search", # Optional: defaults to "serper_dev_web_search" description="Search the web for current information on any topic" # Optional: defaults to component docstring ) # Create pipeline with OpenAIChatGenerator and ToolInvoker pipeline = Pipeline() pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool])) pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool])) # Connect components pipeline.connect("llm.replies", "tool_invoker.messages") message = ChatMessage.from_user("Use the web search tool to find information about Nikola Tesla") # Run pipeline result = pipeline.run({"llm": {"messages": [message]}}) print(result)
-
Add
XLSXToDocument
converter that loads an Excel file using Pandas + openpyxl and by default converts each sheet into a separateDocument
in CSV format. -
Added a new
store_full_path
parameter to the__init__
methods ofPyPDFToDocument
andAzureOCRDocumentConverter
. The default value isTrue
, which stores the full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. -
Add a new experimental component
ToolInvoker
. This component invokes tools based on tool calls prepared by Language Models and returns the results as a list ofChatMessage
objects with tool role. -
Adding a
RecursiveSplitter
, which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, and if the resulting chunks are still larger than the specified size, it moves to the next separator in the list. -
Added a
create_tool_from_function
function to create aToo
instance from a function, with automatic generation of name, description and parameters. Added atool
decorator to achieve the same result. -
Add support for Tools in the Hugging Face API Chat Generator.
-
Changed the
ChatMessage
dataclass to support different types of content, including tool calls, and tool call results. -
Add support for Tools in the OpenAI Chat Generator.
-
Added a new
Tool
dataclass to represent a tool for which Language Models can prepare calls. -
Added the component
StringJoiner
to join strings from different components to a list of strings.
⚡️ Enhancement Notes
-
Added
default_headers
parameter toAzureOpenAIDocumentEmbedder
andAzureOpenAITextEmbedder
. -
Add
token
argument toNamedEntityExtractor
to allow usage of private Hugging Face models. -
Add the
from_openai_dict_format
class method to theChatMessage
class. It allows you to create aChatMessage
from a dictionary in the format that OpenAI's Chat API expects. -
Add a testing job to check that all packages can be imported successfully. This should help detect several issues, such as forgetting to use a forward reference for a type hint coming from a lazy import.
-
DocumentJoiner
methods_concatenate()
and_distribution_based_rank_fusion()
were converted to static methods. -
Improve serialization and deserialization of callables. We now allow serialization of class methods and static methods and explicitly prohibit serialization of instance methods, lambdas, and nested functions.
-
Added new initialization parameters to the
PyPDFToDocument
component to customize the text extraction process from PDF files. -
Reorganized the document store test suite to isolate
dataframe
filter tests. This change prepares for potential future deprecation of the Document class'sdataframe
field. -
Move
Tool
to a new dedicatedtools
package. RefactorTool
serialization and deserialization to make it more flexible and include type information. -
The
NLTKDocumentSplitter
was merged into theDocumentSplitter
which now provides the same functionality as theNLTKDocumentSplitter
. Thesplit_by="sentence"
now uses a custom sentence boundary detection based on thenltk
library. The previoussentence
behaviour can still be achieved bysplit_by="period"
. -
Improved deserialization of callables by using
importlib
instead ofsys.modules
. This change allows importing local functions and classes that are not insys.modules
when deserializing callable. -
Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batc...
v2.9.0-rc1
⭐️ Highlights
Tool Calling Support
We are introducing the Tool
, a simple and unified abstraction for representing tools in Haystack, and the ToolInvoker
, which executes tool calls prepared by LLMs. These features make it easy to integrate tool calling into your Haystack pipelines, enabling seamless interaction with tools when used with components like OpenAIChatGenerator
and HuggingFaceAPIChatGenerator
. Here's how you can use them:
def dummy_weather_function(city: str):
return f"The weather in {city} is 20 degrees."
tool = Tool(
name="weather_tool",
description="A tool to get the weather",
function=dummy_weather_function,
parameters={
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
}
)
pipeline = Pipeline()
pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool]))
pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool]))
pipeline.connect("llm.replies", "tool_invoker.messages")
message = ChatMessage.from_user("How is the weather in Berlin today?")
result = pipeline.run({"llm": {"messages": [message]}})
Use Components as Tools
As an abstraction of Tool
, ComponentTool
allows LLMs to interact directly with components like web search, document processing, or custom user components. It simplifies schema generation and type conversion, making it easy to expose complex component functionality to LLMs.
# Create a tool from the component
tool = ComponentTool(
component=SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3),
name="web_search", # Optional: defaults to "serper_dev_web_search"
description="Search the web for current information on any topic" # Optional: defaults to component docstring
)
New Splitting Method: RecursiveDocumentSplitter
RecursiveDocumentSplitter
introduces a smarter way to split text. It uses a set of separators to divide text recursively, starting with the first separator. If chunks are still larger than the specified size, the splitter moves to the next separator in the list. This approach ensures efficient and granular text splitting for improved processing.
from haystack.components.preprocessors import RecursiveDocumentSplitter
splitter = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
doc_chunks = splitter.run([Document(content="...")])
⚠️ Refactored ChatMessage
dataclass
ChatMessage
dataclass has been refactored to improve flexibility and compatibility. As part of this update, the content
attribute has been removed and replaced with a new text
property for accessing the ChatMessage's textual value. This change ensures future-proofing and better support for features like tool calls and their results. For details on the new API and migration steps, see the ChatMessage documentation.
⬆️ Upgrade Notes
- The refactoring of the
ChatMessage
data class includes some breaking changes involvingChatMessage
creation and accessing attributes. If you have aPipeline
containing aChatPromptBuilder
, serialized withhaystack-ai =< 2.9.0
, deserialization may break. For detailed information about the changes and how to migrate, see the ChatMessage documentation. - Remove deprecated 'converter' init argument from
PyPDFToDocument
. Use other init arguments instead, or create a custom component. - The
SentenceWindowRetriever
output keycontext_documents
now outputs aList[Document]
containing the retrieved documents and the context windows ordered bysplit_idx_start
. - Update default value of
store_full_path
toFalse
in converters - Remove 'is_greedy' deprecated argument from
@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
-
Introduced the
ComponentTool
, a new tool that wraps Haystack components, allowing them to be utilized as tools for LLMs (various ChatGenerators). ThisComponentTool
supports automatic tool schema generation, input type conversion, and offers support for components with run methods that have input types:- Basic types (str, int, float, bool, dict)
- Dataclasses (both simple and nested structures)
- Lists of basic types (e.g.,
List[str]
) - Lists of dataclasses (e.g.,
List[Document]
) - Parameters with mixed types (e.g.,
List[Document]
, str etc.)
Example usage:
from haystack import component, Pipeline from haystack.tools import ComponentTool from haystack.components.websearch import SerperDevWebSearch from haystack.utils import Secret from haystack.components.tools.tool_invoker import ToolInvoker from haystack.components.generators.chat import OpenAIChatGenerator from haystack.dataclasses import ChatMessage # Create a SerperDev search component search = SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3) # Create a tool from the component tool = ComponentTool( component=search, name="web_search", # Optional: defaults to "serper_dev_web_search" description="Search the web for current information on any topic" # Optional: defaults to component docstring ) # Create pipeline with OpenAIChatGenerator and ToolInvoker pipeline = Pipeline() pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool])) pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool])) # Connect components pipeline.connect("llm.replies", "tool_invoker.messages") message = ChatMessage.from_user("Use the web search tool to find information about Nikola Tesla") # Run pipeline result = pipeline.run({"llm": {"messages": [message]}}) print(result)
-
Add
XLSXToDocument
converter that loads an Excel file using Pandas + openpyxl and by default converts each sheet into a separateDocument
in CSV format. -
Added a new
store_full_path
parameter to the__init__
methods ofPyPDFToDocument
andAzureOCRDocumentConverter
. The default value isTrue
, which stores the full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. -
Add a new experimental component
ToolInvoker
. This component invokes tools based on tool calls prepared by Language Models and returns the results as a list ofChatMessage
objects with tool role. -
Adding a
RecursiveSplitter
, which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, and if the resulting chunks are still larger than the specified size, it moves to the next separator in the list. -
Added a
create_tool_from_function
function to create aToo
instance from a function, with automatic generation of name, description and parameters. Added atool
decorator to achieve the same result. -
Add support for Tools in the Hugging Face API Chat Generator.
-
Changed the
ChatMessage
dataclass to support different types of content, including tool calls, and tool call results. -
Add support for Tools in the OpenAI Chat Generator.
-
Added a new
Tool
dataclass to represent a tool for which Language Models can prepare calls. -
Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.
-
We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.
-
Added a new
store_full_path
parameter to the__init__
methods ofJSONConverter
,MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. -
Added a new
store_full_path
parameter to the__init__
method ofCSVToDocument
,DOCXToDocument
, andHTMLToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. -
Added component StringJoiner to join strings from different components to a list of strings.
-
When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.
-
Add Time-to-First-Token (TTFT) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
-
Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
-
Added
default_headers
parameter toAzureOpenAIDocumentEmbedder
andAzureOpenAITextEmbedder
. -
Add
token
argument toNamedEntityExtractor
to allow usage of private Hugging Face models. -
Across Haystack codebase, we have replaced the use of
ChatMessage
dataclass constructor with specific class met...
v2.8.1
Release Notes
v2.8.1
Bug Fixes
- Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.
- PyPDFToDocument now creates documents with id based on converted text and meta data. Before it didn't take the meta data into account.
- Fixes issues with deserialization of components in multi-threaded environments.
v2.8.1-rc3
Release Notes
v2.8.1-rc3
Bug Fixes
- PyPDFToDocument now creates documents with id based on converted text and meta data. Before it didn't take the meta data into account.
v2.8.1-rc2
Bug Fixes
- Fixes issues with deserialization of components in multi-threaded environments.
v2.8.1-rc1
Bug Fixes
- Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.
v2.8.1-rc2
Release Notes
v2.8.1-rc2
Bug Fixes
- Fixes issues with deserialization of components in multi-threaded environments.
v2.8.1-rc1
Bug Fixes
- Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.
v2.8.1-rc1
Release Notes
v2.8.1-rc1
Bug Fixes
- Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.
v2.8.0
Release Notes
⬆️ Upgrade Notes
- Remove
is_greedy
deprecated argument from@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
- We've added a new
DALLEImageGenerator
component, bringing image generation with OpenAI's DALL-E to the Haystack- Easy to Use: Just a few lines of code to get started:
from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)
- Easy to Use: Just a few lines of code to get started:
- Add warning logs to the
PDFMinerToDocument
andPyPDFToDocument
to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to theDocumentSplitter
that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening. - We have added a new
MetaFieldGroupingRanker
component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. - Added a new
store_full_path
parameter to the__init__
methods of the following converters:
JSONConverter
,CSVToDocument
,DOCXToDocument
,HTMLToDocument
MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
,PyPDFToDocument
,AzureOCRDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. - When making function calls via
OpenAPI
, allow both switching SSL verification off and specifying a certificate authority to use for it. - Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
- Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
- Across Haystack codebase, we have replaced the use of
ChatMessage
data class constructor with specific class methods (ChatMessage.from_user
,ChatMessage.from_assistant
, etc.). - Added the Maximum Margin Relevance (MMR) strategy to the
SentenceTransformersDiversityRanker
. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. - Introduces optional parameters in the
ConditionalRouter
component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters. - Added split by line to
DocumentSplitter
, which will split the document at n. - Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches. - Added new initialization parameters to the
PyPDFToDocument
component to customize the text extraction process from PDF files. - Replace usage of
ChatMessage.content
withChatMessage.text
across the codebase. This is done in preparation for the removal ofcontent
in Haystack 2.9.0.
⚠️ Deprecation Notes
- The default value of the
store_full_path
parameter in converters will change toFalse
in Haysatck 2.9.0 to enhance privacy. - In Haystack 2.9.0, the
ChatMessage
data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A newtext
property has been introduced to provide access to the textual value of theChatMessage
. To ensure a smooth transition, start using thetext
property now in place ofcontent
. - The
converter
parameter in thePyPDFToDocument
component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future. - The output of
context_documents
inSentenceWindowRetriever
will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered bysplit_idx_start
.
🐛 Bug Fixes
-
Fix
DocumentCleaner
not preserving allDocument
fields when run -
Fix
DocumentJoiner
failing when ran with an empty list of Documents -
For the
NLTKDocumentSplitter
we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases ofDoc1 = [s1, s2], Doc2 = [s1, s2, s3]
. -
Finished adding function support for this component by updating the
_split_into_units
function and added thesplitting_function
init
parameter. -
Add specific
to_dict
method to overwrite the underlying one fromDocumentSplitter
. This is needed to properly save the settings of the component to yaml. -
Fix
OpenAIChatGenerator
andOpenAIGenerator
crashing when using a streaming_callback andgeneration_kwargs
contain{"stream_options": {"include_usage": True}}
. -
Fix tracing
Pipeline
with cycles to correctly track components execution -
When meta is passed into
AnswerBuilder.run()
, it is now merged intoGeneratedAnswer
meta -
Fix
DocumentSplitter
to handle customsplitting_function
without requiringsplit_length.
Previously thesplitting_function
provided would not override other settings.
v2.8.0-rc3
Release Notes
⬆️ Upgrade Notes
- Remove
is_greedy
deprecated argument from@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
- We've added a new
DALLEImageGenerator
component, bringing image generation with OpenAI's DALL-E to the Haystack- Easy to Use: Just a few lines of code to get started:
`python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)
`
- Easy to Use: Just a few lines of code to get started:
- Add warning logs to the
PDFMinerToDocument
andPyPDFToDocument
to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to theDocumentSplitter
that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening. - We have added a new
MetaFieldGroupingRanker
component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. - Added a new
store_full_path
parameter to the__init__
methods of the following converters:
JSONConverter
,CSVToDocument
,DOCXToDocument
,HTMLToDocument
MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
,PyPDFToDocument
,AzureOCRDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. - When making function calls via
OpenAPI
, allow both switching SSL verification off and specifying a certificate authority to use for it. - Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
- Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
- Across Haystack codebase, we have replaced the use of
ChatMessage
data class constructor with specific class methods (ChatMessage.from_user
,ChatMessage.from_assistant
, etc.). - Added the Maximum Margin Relevance (MMR) strategy to the
SentenceTransformersDiversityRanker
. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. - Introduces optional parameters in the
ConditionalRouter
component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters. - Added split by line to
DocumentSplitter
, which will split the document at n. - Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches. - Added new initialization parameters to the
PyPDFToDocument
component to customize the text extraction process from PDF files. - Replace usage of
ChatMessage.content
withChatMessage.text
across the codebase. This is done in preparation for the removal ofcontent
in Haystack 2.9.0.
⚠️ Deprecation Notes
- The default value of the
store_full_path
parameter in converters will change toFalse
in Haysatck 2.9.0 to enhance privacy. - In Haystack 2.9.0, the
ChatMessage
data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A newtext
property has been introduced to provide access to the textual value of theChatMessage
. To ensure a smooth transition, start using thetext
property now in place ofcontent
. - The
converter
parameter in thePyPDFToDocument
component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future. - The output of
context_documents
will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered bysplit_idx_start
.
🐛 Bug Fixes
-
Fix
DocumentCleaner
not preserving allDocument
fields when run -
Fix
DocumentJoiner
failing when ran with an empty list of Documents -
For the
NLTKDocumentSplitter
we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases ofDoc1 = [s1, s2], Doc2 = [s1, s2, s3]
. -
Finished adding function support for this component by updating the
_split_into_units
function and added thesplitting_function
init
parameter. -
Add specific
to_dict
method to overwrite the underlying one fromDocumentSplitter
. This is needed to properly save the settings of the component to yaml. -
Fix
OpenAIChatGenerator
andOpenAIGenerator
crashing when using a streaming_callback andgeneration_kwargs
contain{"stream_options": {"include_usage": True}}
. -
Fix tracing
Pipeline
with cycles to correctly track components execution -
When meta is passed into
AnswerBuilder.run()
, it is now merged intoGeneratedAnswer
meta -
Fix
DocumentSplitter
to handle customsplitting_function
without requiringsplit_length.
Previously thesplitting_function
provided would not override other settings.
v2.8.0-rc2
Release Notes
⬆️ Upgrade Notes
- Remove
is_greedy
deprecated argument from@component
decorator. Change theVariadic
input of your Component toGreedyVariadic
instead.
🚀 New Features
- We've added a new
DALLEImageGenerator
component, bringing image generation with OpenAI's DALL-E to the Haystack- Easy to Use: Just a few lines of code to get started:
`python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)
`
- Easy to Use: Just a few lines of code to get started:
- Add warning logs to the
PDFMinerToDocument
andPyPDFToDocument
to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to theDocumentSplitter
that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening. - We have added a new
MetaFieldGroupingRanker
component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. - Added a new
store_full_path
parameter to the__init__
methods of the following converters:
JSONConverter
,CSVToDocument
,DOCXToDocument
,HTMLToDocument
MarkdownToDocument
,PDFMinerToDocument
,PPTXToDocument
,TikaDocumentConverter
,PyPDFToDocument
,AzureOCRDocumentConverter
andTextFileToDocument
. The default value isTrue
, which stores full file path in the metadata of the output documents. When set toFalse
, only the file name is stored. - When making function calls via
OpenAPI
, allow both switching SSL verification off and specifying a certificate authority to use for it. - Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
- Added a new option to the required_variables parameter to the
PromptBuilder
andChatPromptBuilder
. By passingrequired_variables="*"
you can automatically set all variables in the prompt to be required.
⚡️ Enhancement Notes
- Across Haystack codebase, we have replaced the use of
ChatMessage
data class constructor with specific class methods (ChatMessage.from_user
,ChatMessage.from_assistant
, etc.). - Added the Maximum Margin Relevance (MMR) strategy to the
SentenceTransformersDiversityRanker
. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. - Introduces optional parameters in the
ConditionalRouter
component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters. - Added split by line to
DocumentSplitter
, which will split the document at n - Change
OpenAIDocumentEmbedder
to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches. - Added new initialization parameters to the
PyPDFToDocument
component to customize the text extraction process from PDF files. - Replace usage of
ChatMessage.content
withChatMessage.text
across the codebase. This is done in preparation for the removal ofcontent
in Haystack 2.9.0.
⚠️ Deprecation Notes
- The default value of the
store_full_path
parameter will change to False in Haysatck 2.9.0 to enhance privacy. - The default value of the
store_full_path
parameter in converters will change toFalse
in Haysatck 2.9.0 to enhance privacy. - In Haystack 2.9.0, the
ChatMessage
data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A newtext
property has been introduced to provide access to the textual value of theChatMessage
. To ensure a smooth transition, start using thetext
property now in place ofcontent
. - The
converter
parameter in thePyPDFToDocument
component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future.
🐛 Bug Fixes
-
Fix
DocumentCleaner
not preserving allDocument
fields when run -
Fix
DocumentJoiner
failing when ran with an empty list of Documents -
For the
NLTKDocumentSplitter
we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases ofDoc1 = [s1, s2], Doc2 = [s1, s2, s3]
. -
Finished adding function support for this component by updating the
_split_into_units
function and added thesplitting_function
init
parameter. -
Add specific
to_dict
method to overwrite the underlying one fromDocumentSplitter
. This is needed to properly save the settings of the component to yaml. -
Fix
OpenAIChatGenerator
andOpenAIGenerator
crashing when using a streaming_callback andgeneration_kwargs
contain{"stream_options": {"include_usage": True}}
. -
Fix tracing
Pipeline
with cycles to correctly track components execution -
When meta is passed into
AnswerBuilder.run()
, it is now merged intoGeneratedAnswer
meta -
Fix
DocumentSplitter
to handle customsplitting_function
without requiringsplit_length.
Previously thesplitting_function
provided would not override other settings.